General CS Notes

  1. Algorithms
  2. Data Structures
  3. Machine Learning
  4. Problems
  5. Python
  6. The Hardware-Software Interface

General CS Notes

Python

Data Structures, Internals, Advanced Use

I’ll be assuming CPython throughout.

TODO:

bytes

what for
bytes Immutable container w/ len(). Idiomatic build from list[bytes] chunks: b''.join(chunks)
bytearray Supports mutation: a[7] = 0xf4
io.BytesIO File-like interface (read/write/seek)

list

Python’s list is a plain old dynamic array.

Basic implications: deleting anywhere but the end incurs an O(n) cost.

Some interesting notes:

Deep-Dive: List Source Code

Source https://github.com/python/cpython/blob/main/Objects/listobject.c

Quite readable. Observations from the code overall:

  • wow there’s a lot of code protected by #ifdef Py_GIL_DISABLED, I bet a lot of maintainers hated this
  • garbage collection, allocation, and freeing takes up a bunch of the other code implementing operations
  • the rest is pretty straightforward, readable code implementing operations as you’d expect
  • they use goto for errors
  • it’s striking in that if you were to implement a language in C with garbage collection and objects, this is probably what it’d look like. I didn’t expect that. I guess I thought there’d be much more “opinionated” code or something. But instead, it feels like, if you had very similar goals to Python, you’d end up implementing something that looks a lot like Python’s source code.
  • there’s a bunch of Py_BEGIN_CRITICAL_SECTION(...)/Py_END_CRITICAL_SECTION(), I’m curious what these do.
  • the list uses “allocated” for capacity and “size” for length.
  • I forgot about the joy of C, where you can’t tell from the #includes where stuff comes from. E.g., PyMem_Malloc, where are you from? (Editor’s note: see Python Memory Management if curious.)
  • checking the blame of the source code is wild
    • nearly every line has been changed
    • a huge amount have been most recently changed by the GIL removal effort
    • whole block design comments are from Guido’s “Initial revision” commit 35 years ago

Here’s an interesting comment on allocations, especially since I was just looking up amortized time for growing dynamically-sized arrays depending on growth strategy:

This over-allocates proportional to the list size, making room for additional growth. The over-allocation is mild, but is enough to give linear-time amortized behavior over a long sequence of appends() in the presence of a poorly-performing system realloc(). Add padding to make the allocated size multiple of 4. The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, …

– source

Linear-time amortized behavior! From the phrasing, it sounds like linear-time here is a good result. I would have expected it to be bad, in contrast with constant time (from doubling). However, I’m sure it’s a tradeoff with memory inefficiency, and likely there’s better performance with a good system realloc(). (I’d guess best-case realloc is that it has great performance, even O(1), if it can either directly expand the memory region, or copy (map?) the old contents over very quickly.)

Update: the phrasing here confused me, but I think it’s saying that it’s amortized linear time O(n) for n appends, i.e., it’s amortized O(1) per append. This checks out from an (arguing with LLMs) analysis, which is that as long as we’re growing by a rate that doesn’t depend on n, we’ll amortize out to constant time per append.

This also makes me wonder: is resizing an array actually often cheaper than O(n), because reallocating might have O(1) performance? I could imagine this straying from amortized analysis to average-case analysis, because I’d guess whether reallocation can trivially expand the same region depends on the sizes of the old and new regions. E.g., I bet smaller reallocations are more likely to succeed without copying than larger ones due to fragmentation.

Update: Surprisingly, despite realloc seeming like a really efficient choice, it seems unlikely it’s used. Python’s newer memory manager mimalloc implements reallocation by allocating and copying. The core lines are:

// from _mi_heap_realloc_zero(...)
void* newp = mi_heap_malloc(heap,newsize);
const size_t copysize = (newsize > size ? size : newsize);
_mi_memcpy(newp, p, copysize);
mi_free(p);

I think this may be because custom memory managers (mimalloc replaces malloc/realloc/etc.) use mmap/munmap as the underlying memory primitive, and while mremap exists as an mmap-flavorted alternative to realloc, it’s Linux-only, so might not be relied on for a general purpose cross-system library.

This all means that resizing really is going to be O(n) rather than O(1).

I would have expected, with all the layers of allocators, that something like a list would draw from a pool with ready-to-expand memory regions. But it seems like a list goes straight to the “raw” memory API.

The code references “system realloc()”, but interestingly calls PyMem_Realloc. This makes sense because Python is multi-OS, so they’d need to abstract over system calls. I’m curious how much of their own memory management they do. Like whether they ask for large chunks of memory from the system and then manage it themselves with various pools, or if they just allocate on demand.

Update: The former. (But also, they do allocate on demand!) Deeper dive in Python Memory Management

One other question is: where is the length (size) kept? The list C data structure looks to simply be:

// listobject.c
typedef struct {
    Py_ssize_t allocated;
    PyObject *ob_item[];
} _PyListArray;

allocated is the capacity, and then there’s the backing array of pointers ob_item. But where’s the length (i.e., actually filled elements)?

… well, that definition and its usage is always within #ifdef Py_GIL_DISABLED guards. So I think that’s the “no GIL” version. Let’s instead trace the standard version, which is found in the header file:

// listobject.h
typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

What!? Same problem! We have allocated and then the backing array of pointers ob_item. The magic is in PyObject_VAR_HEAD. I think this gets defined to always be a variable of type struct PyVarObject named ob_base. A field in the PyVarObject struct is ob_size, which contains the length.

// object.h

/* PyObject_VAR_HEAD defines the initial segment of all variable-size
 * container objects. ... */
#define PyObject_VAR_HEAD      PyVarObject ob_base;

struct PyVarObject {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
};
typedef struct PyVarObject PyVarObject;

id()

Small aside, but id() actually gives the object’s C pointer. This is cool. Some notes on this:

yield

The presence of yield makes a function actually return a Generator g when called.

Using the Generator g:

def foo():
    for i in range(5):
        yield i
        yield i * 2  # no problem! yielded next time
        if i == 2:
            return  # raises StopIteration early

g = foo()
list(g)  # same as: list(foo())
# => [0, 0, 1, 2, 2, 4]

Example 1: multiple yields, and return early.

def infinite():
    n = -1
    while True:
        yield n
        n -= 1

g = infinite()
[next(g) for _ in range(5)]
# => [-1, -2, -3, -4, -5]

Example 2: infinite generator (counts down from -1).

def squarer():
    received = 0  # initial
    while True:
        received = yield received**2

g = squarer()

# priming: two identical options:
next(g)         # 1. returns 0 (we discard)
# g.send(None)  # 2. returns 0 (we discard)

[g.send(x) for x in range(1, 6)]
# => [1, 4, 9, 16, 25]

Example 3: send() values in after priming.

General CS Notes series


← previous

4. Problems

post info


Published Jul 31, 2025
Disclaimer This is an entry in the garage. It may change or disappear at any point.
Tags garage
Inbound
Outbound