With the adoption of async-await syntax, modern Python has seen an emergence of coroutine-based asynchronous programming. Frameworks such as the standard library asyncio, Trio, and Dave Beazley's Curio provide event loop implementations and high-level APIs for running coroutines, spawning tasks, and synchronizing between them. Nowadays, many see async Python as the de-facto standard approach to writing high-performance network-bound code, such as web servers and database interface libraries. This includes us at Applifting, and since sharing know-how is part of our culture, we are always eager to talk about our experience and practices. In this article, we will discuss the upcoming addition of task groups to Python’s asyncio and how they help us write resilient and maintainable concurrent code at scale.
Signull case study
At Applifting, we have chosen Python to develop the backend for Signull, a cryptocurrency market analysis tool for powertraders. The project faced a number of technical challenges. In the initial phases, the product team was navigating the uncharted and ever-changing crypto domain and looking to shape an MVP for user validation. Developers needed to make swift deliveries and continuously iterate on new ideas, making it difficult to lay a solid architectural foundation. We were experimenting with various data sources, designing and deprecating worker services on a weekly basis, and looking for ways to ingest live price data for tens of thousands of instruments with minimal latency. The network-bound nature of most technical problems made Python an attractive choice; the concurrency model adopted by asyncio fit the bill nicely.
Signull’s data ingress operates at the scale of hundreds of HTTP requests per second, all the while retrying failed requests, respecting variable rate limits, and synchronizing responses with data fed over not-always-reliable websocket connections. Some workers operate in multiple replicas to facilitate the rate of ingestion, depending on Redis and RabbitMQ for synchronization. As the system scaled, we realized that the product's success will depend on sound usage of synchronization mechanisms, re-entrancy, and resilience in face of network issues and unreliable data providers.
We learned many valuable lessons on this journey. One of them is that concurrency at scale desperately needs—yet often lacks—strict and enforceable structure. Before we dive into the details of what this means in practice, let us recapitulate on Python’s concurrency model.
Coroutine-based concurrency
Coroutines can be understood as an alternative concurrency model to shared-state threading (whether system native or not). In the Python community, the threading module is often dismissed as inadequate or even pointless due to the notorious CPython GIL (although there are valid reasons for its existence). GIL aside, however, multithreading as an implementation-agnostic concept is still burdened by a number of issues. In a system with preemptive scheduling and arbitrary concurrent execution, local reasoning becomes significantly more difficult and error prone. Developers must introduce mutex and synchronization mechanisms to protect against race conditions, but the correctness of such mitigations is difficult to verify and must be considered whenever making adjustments to the code or even calling it.
You have to have a level of vigilance bordering on paranoia just to make sure that your conventions around where state can be manipulated and by whom are honoured, because when such an interaction causes a bug it’s nearly impossible to tell where it came from.
Coroutines differ from threads in that they implement cooperative multitasking—they must yield control or suspend explicitly (e.g. via a yield or await statement). This means that the programmer is always aware of a potential context switch and is able to arrange a graceful and safe suspension. Glyph compares this sort of statement to a relief valve: a single clearly marked point where we have to consider the implications of a potential transfer of control. As such, coroutines can be thought of as semantic improvement over threads.
The problem of runaway tasks
Despite the convenience of coroutine-based concurrency, Python's asyncio module has long lacked an intuitive and convenient way to manage groups of concurrently running tasks. The current API revolves around create_task, which returns a task handle to the user. The user is then responsible for keeping references to running tasks, collecting return values, and handling safe cancellation in case of errors. This is notoriously difficult and prone to errors. The lack of correct task management leads to runaway tasks, which never get awaited by the parent or checked for exceptions. As a result, the program can easily end up in an invalid state while failing to emit any kind of error or warning.
Consider the following code: