1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
|
============================================================
Encapsulating low-level implementation aspects
============================================================
.. contents::
.. sectnum::
Abstract
========
It has always been a major goal of PyPy to not force implementation
decisions. This means that even after the implementation of the
standard interpreter [#]_ has been written we are still able to experiment
with different approaches to memory management or concurrency and to
target wildly different platforms such as the Java Virtual Machine or
a very memory-limited embedded environment.
We do this by allowing the encapsulation of these low level aspects as
well defined parts of the translation process.
In the following document, we give examples of aspects that have been
successfully encapsulated in more detail and contrast the potential of
our approach with CPython.
.. [#] `standard interpreter`_ is our term for the code which
implements the Python language, i.e. the interpreter and the
standard object space.
Background
==========
One of the better known significant modifications to CPython are
Christian Tismer's "stackless" patches [STK]_, which allow for far more
flexible control flow than the typical function call/return supported by
CPython. Originally implemented as a series of invasive patches,
Christian found that maintaining these patches as CPython itself was
further developed was time consuming to the point of no longer being
able to work on the new functionality that was the point of the
exercise.
One solution would have been for the patches to become part of core
CPython but this was not done partly because the code that fully
enabled stackless required widespread modifications that made the code
harder to understand (as the "stackless" model contains control flow
that is not easily expressable in C, the implementation became much
less "natural" in some sense).
With PyPy, however, it is possible to obtain this flexible control
flow whilst retaining transparent implementation code as the necessary
modifications can be implemented as a localized translation aspect,
and indeed this was done at the Paris sprint in a couple of days (as
compared to around six months for the original stackless patches).
Of course, this is not the only aspect that can be so decided a
posteriori, during translation.
Translation aspects
===================
Our standard interpreter is implemented at a very high level of
abstraction. This has a number of happy consequences, among which is
enabling the encapsulation of language aspects as described in this
document. For example, the implementation code simply makes no
reference to memory management, which clealy gives the translator
complete freedom to decide about this aspect. This constrasts with
CPython where the decision to use reference counting is reflected tens
or even hundreds of times in each C source file in the codebase.
As described in [ARCH]_, producing a Python implementation from the
source of our standard interpreter involves various stages: the
initialization code is run, the resulting code is annotated, typed and
finally translated. By the nature of the task, the encapsulation of
*low-level aspects* mainly affects the typer and the translation
process. At the coarsest level, the selection of target platform
involves writing a new backend -- still a significant task, but much
much easier than writing a complete implementation of Python!
Other aspects affect different levels, as their needs require. The
remainder of this section describes a few aspects that we have
successfully enscapsulated.
An advantage of our approach is that any combination of aspects can be
freely selected, avoiding the problem of combinatorial explosion of
variants seen in manually written interpreters.
Stacklessness
-------------
The stackless modifications are mostly implemented in the C backend,
with a single extra flow graph operation to influence some details of
the generated C code. The total changes only required about 300 lines
of source, vindicating our abstract approach.
In stackless mode, the C backend generates functions that are
systematically extended with a small amount of bookkeeping code. This
allows the C code to save its own stack to the heap on demand, where it
can then be inspected, manipulated and eventually resumed. This is
described in more detail in [TA]_. In this way, unlimited (or more
precisely heap-limited) recursion is possible, even on operating systems
that limit the size of the C stack. Alternatively, a different saved
stack can be resumed, thus implementing soft context switches -
coroutines, or green threads with an appropriate scheduler. We reobtain
in this way all the major benefits of the original "stackless" patches.
This effect requires a number of changes in each and every C function
that would be extremely tedious to write by hand: checking for the
signal triggering the saving of the stack, actually saving precisely the
currently active local variables, and when re-entering the function
check which variables are being restored and which call site is resumed.
In addition, a couple of global tables must be maintained to drive the
process. The key point is that we can fine-tune all these interactions
freely, without having to rewrite the whole code all the time but only
modifying the C backend (in addition, of course, to being able to change
at any time the high-level code that is the input of the translation
process). So far, this allowed us to find a style that does not hinder
the optimisations performed by the C compiler and so has only a minor
impact on performance in the normal case.
Also note that the fact that the C stack can be fully saved into the
heap can tremendously simplify the portable implementation of garbage
collection: after the stack has been completely transfered to the heap,
there are no roots left on the stack.
Multiple Interpreters
---------------------
Another implementation detail that causes tension between functionality
and both code clarity and memory consumption in CPython is the issue of
multiple independent interpreters in the same process. In CPython there
is a partial implementation of this idea in the "interpreter state" API,
but the interpreters produced by this are not truly independent -- for
instance the dictionary that contains interned strings is implemented as
file-level static object, and is thus shared between the interpreters.
A full implementation of this idea would entirely eschew the use of file
level statics and place all interpreter-global data in some large
structure, which would hamper readability and maintainability. In
addition, in many situations it is necessary to determine which
interpreter a given object is "from" -- and this is not possible in
CPython largely because of the memory overhead that adding a 'interp'
pointer to all Python objects would create.
In PyPy, all of our implementation code manipulates an explicit object
space instance, as described in [CODG]_. The situation of multiple
interpreters is thus handled automatically: if there is only one space
instance, it is regarded as a pre-constructed constant and the space
object pointer (though not its non-constant contents) disappears from
the produced source, i.e. from function arguments, local variables and
instance fields. If there are two or more such instances, a 'space'
attribute will be automatically added to all application objects (or
more precisely, it will not be removed by the translation process), the
best of both worlds.
Memory Management
-----------------
As mentioned above, CPython's decision to use a garbage collector based
on reference counting is reflected throughout its source. In the
implementation code of PyPy, it is not, and in fact the standard
interpreter can currently be compiled to use a reference counted scheme
or the Boehm GC [BOEHM]_. Again, more details are in [TA]_. We also
have an experimental framework for developing custom exact GCs [GC]_,
but it is not yet integrated with the low-level translation back-ends.
Another advantage of the aspect oriented approach shows itself most
clearly with this memory management aspect: that of correctness.
Although reference counting is a fairly simple scheme, writing code for
CPython requires that the programmer make a large number of
not-quite-trivial decisions about the refcounting code. Experience
suggests that mistakes will always creep in, leading to crashes or
leaks. While tools exist to help find these mistakes, it is surely
better to not write the reference count manipulations at all and this is
what PyPy's approach allows. Writing the code that emits the correct
reference count manipulations is surely harder than writing any one
piece of explicit refcounting code, but once it is done and tested, it
just works without further effort.
Concurrency
-----------
The aspect of CPython's implementation that has probably caused more
discussion than any other mentioned here is that of the threading
model. Python has supported threads since version 1.5 with what is
commonly referred to as the "Global Interpreter Lock" or GIL; the
execution of bytecodes is serialized such that only one thread can be
executing Python code at one time. This has the benefit of being
relatively unintrusive and not too complex, but has the disadvantage
that multi-threaded, computation-bound Python code does not gain
performance on multi-processor machines.
PyPy will offer the opportunity to experiment with different models,
although currently we only offer a version with no thread support and
another with a GIL-like model [TA]_. (We also plan to support soon
"green" software-only threads in the Stackless model described above,
but obviously this would not solve the multi-processor scalability
issue.)
The future work in this direction is to collect the numerous possible
approaches that have between thought out along the years and
e.g. presented on the CPython development mailing list. Most of them
have never been tried out in CPython, for lack of necessary resources.
A number of them are clearly easy to try out in PyPy, at least in an
experimental version that would allow its costs to be assessed -- for
example, various forms of object-level locking.
Evaluation Strategy
-------------------
Possibly the most radical aspect to tinker with is the evaluation
strategy. The thunk object space [OBJS]_ wraps the standard object
space to allow the production of "lazily computed objects", i.e. objects
whose values are only calculated when needed. It also allows global and
total replacement of one object with another.
The thunk object space is mostly meant as an example of what our
approach can acheive -- the combination of side-effects and lazy
evaluation is not easy to understand. This demonstration is important
because this level of flexibility will be required to implement future
features along the lines of Prolog-style logic variables, transparent
persistency, object distribution across several machines, or
object-level security.
Experimental results
====================
All the aspects described in the previous chapter have been successfully
implemented and are available since the release 0.7 or 0.8 of PyPy.
We have conducted preliminary experimental measures of the performance
impact of enabling each of these features in the compiled PyPy
interpreter. We present below the current results as of October 2005.
Most figures appear to vary from machine to machine. Given that the
generated code is large (it produce a binary of 5.6MB on a Linux
Pentium), there might be locality and code ordering issues that cause
important cache effects.
We have not particularly optimised any of these aspects yet. Our goal
is primarily to prove that the whole approach is worthwhile, and rely on
future work and push for external contributions to implement
state-of-the-art techniques in each of these domains.
Stacklessness
Producing Stackless-style C code means that all the functions of the
PyPy interpreter that can be involved in recursions contain stack
bookkeeping code (leaf functions, functions calling only leaves,
etc. do not need to use this style). The current performance impact
is to make PyPy slower by about 8%. A couple of minor pending
optimisations could reduce this figure a bit. We expect the rest of
the performance impact to be mainly caused by the increase of size
of the generated executable (+28%).
Multiple Interpreters
A binary that allowed selection between two copies of the standard
object space with a command line switch was about 10% slower and
about 40% larger than the default. Most of the extra size is
likely accounted for by the duplication of the large amount of
prebuilt data involved in an instance of the standard object
space.
Memory Management
The [Boehm] GC is well-optimised and produces excellent results. By
comparison, using reference counting instead makes the interpreter
twice as slow. This is almost certainly due to the naive approach
to reference counting used so far, which updates the counter far
more often than strictly necessary; we also still have a lot of
objects that would theoretically not need a reference counter,
either because they are short-lived or because we can prove that
they are "owned" by another object and can share its lifetime. In
the long run, it will be interesting to see how far this figure can
be reduced, given past experiences with CPython which seem to show
that reference counting is a viable idea for Python interpreters.
Concurrency
No experimental data available so far. Just enabling threads
currently creates an overhead that hides the real costs of locking.
Evaluation Strategy
When translated to C code, the Thunk object space has a global
performance impact of 6%. The executable is 13% bigger (probably
due to the arguably excessive inlining we perform).
We have described five aspects in this document, each currently with
two implementation choices, leading to 32 possible translations. We
have not measured the performance of each variant, but the few we have
tried suggests that the performance impacts are what one would expect,
e.g. a translated stackless binary using the thunk object space would
be expected to be about 1.06 x 1.08 ~= 1.14 times slower than the
default and was found to be 1.15 times slower.
Conclusion
==========
Although still a work in progress, we believe that the successes we
have had in enscapsulating implementation aspects justifies the
approach we have taken. In particular, the relative ease of
implementing the translation aspects described in this paper -- as
mentioned above, the stackless modifications took only a few days --
means we are confident that it will be easily possible to encapsulate
implementation aspects we have not yet considered.
References
==========
.. [ARCH] `Architecture Overview`_, PyPy documentation, 2003-2005
.. [BOEHM] `Boehm-Demers-Weiser garbage collector`_, a garbage collector
for C and C++, Hans Boehm, 1988-2004
.. [CODG] `Coding Guide`_, PyPy documentation, 2003-2005
.. [GC] `Garbage Collection`_, PyPy documentation, 2005
.. [OBJS] `Object Spaces`_, PyPy documentation, 2003-2005
.. [STK] `Stackless Python`_, a Python implementation that does not use
the C stack, Christian Tismer, 1999-2004
.. [TA] `Memory management and threading models as translation aspects`_,
PyPy documentation (and EU Deliverable D05.3), 2005
.. _`standard interpreter`: architecture.html#standard-interpreter
.. _`Architecture Overview`: architecture.html
.. _`Coding Guide`: coding-guide.html
.. _`Garbage Collection`: garbage_collection.html
.. _`Object Spaces`: objspace.html
.. _`Stackless Python`: http://www.stackless.com
.. _`Boehm-Demers-Weiser garbage collector`: http://www.hpl.hp.com/personal/Hans_Boehm/gc/
.. _`Memory management and threading models as translation aspects`: translation-aspects.html
|