Python 3000

JEREMY MANSON: I know you've all come here to see ***, of course, but I wanted to first say that this is the latest in our series of talks about programming languages topics at Google. The goal of this series of talks is to have everybody who knows something about programming languages that Googlers in general don't know come up here and give a series of talks. Obviously, we're very lucky here today to have *** Van Rossum, the benevolent dictator for Life of Python. But you don't have to be *** Van Rossum to give this talk. To give this talk, you do. But to give a talk at the series, you don't have to be. So please, if you have ideas for talks, if you want to give a talk, come up and see me. My name is-- or email me-- my name is Jeremy Manson. And again-- OK, so now I should move on to the actual meat of the talk here. Again, *** Van Rossum is the benevolent dictator for Life of Python. He is the creator and father of Python, and we're very, very lucky to have him able to give a talk to us here today. And here he is. *** VAN ROSSUM: Thank you. Jeremy. Thanks for giving me the opportunity to do a preview of my talk, which is going to be a keynote at Python conference next week. Reminder for Googlers-- we're going to put this up on Google Video, so please don't ask any Google-sensitive questions. Quick overview of what I hope to be talking about, and I'll make sure that we will not skip the last two bullets-- "What You Can Do Today" and "Questions." What happened since last summer-- I mean, I started giving Python 3000 talks-- well, I really started giving Python 3000 talks about seven years ago in 2000. For a very long time it was purely a daydream. It was purely conceptual. It was going to be the next big thing. Early last year, we decided to really go make an effort, fix a set of features, and actually start implementing. And so gradually over last year the plans became more solid and we had various revisions of the schedule. I'll give a little bit of a timeline. I'll give some highlights. If you have to leave in 10 minutes, stay until the highlights slide is finished. Then I'm going to have a long laundry list of various things that will definitely, or most likely, or in one or two cases potentially make it into this new release. And the things that are sort of most interesting from the developers from the end user's perspective. I'll try to say a bit about how you turn your Python 2 code into Python 3 code, which is not completely trivial, but also doesn't have to be a tedious, completely manual process. And I'll start giving some hints. And over the next six to nine months those hints will probably improve in quality on what you can do to your code today to be ready for Python 3000. Basically to make the final transition as easy as possible. So we started with having lots of discussions. At some point I actually had to say, we've had enough discussions. Let's get down to implementation work. And I had to say that several times, and I think the last time I said it was around Christmas 2006. Since then, it's really been very much nose to the grindstone, work out the details on features that we know we're going to have, and work on the implementation. And sometimes the implementation actually informs the specification as things go. We did write quite a few PEPs-- still not enough, in my view. And I think we're pretty much on schedule in terms of writing code. But it's certainly going to be a big effort between now and June. Which takes us to the "Timeline" slide. I hope that by April this year we'll really be done with the sort of feature proposal process. And the feature selection process should be done soon after that, because that typically goes hand in hand. We don't collect all the proposals and then there's a long pause where somebody selects them. We discuss them as they are being proposed, and as they are sort of finalized, that also means they are accepted. So I hope to be able to complete that by April. Then by June I hope to actually have a first alpha release. Then I'm giving myself and the developers another year to sort of work through the feedback, shave off the sharp bits, improve performance. Because at the moment, we're really feature-driven and performance sometimes goes by the wayside. Increasingly get users to actually try the new Python with their source code, with their applications. And then hopefully in 2008, in June, or somewhere in the middle of next year we'll actually have a release that we can be happy with. That doesn't mean that at that point, everyone who is using 2.x will be forced to upgrade. There is going to be a Python 2.6 release actually somewhat earlier than the planned Python 3.0 release. Although you never know. Releases tend to sort of fluctuate a bit. 2.6 is the first release that is going to make an active effort to also incorporate things that will help you transition to Python 3000. It will, in some cases, have options that turn on warnings for things that are going to disappear. And in some cases features from Python 3000 will actually be backported into 2.6. And unless the transition goes really smooth for everyone immediately, it's very likely that there will also be a 2.7 release at the usual schedule for the 2.x releases. So the highlights-- and I have more slides on each of these-- but print is going to be a function. That is sort of-- I just implemented that last week, and I really have to get used to it still. But it is the right thing to do. Dictionary views are even fresher. Oh, by the way-- a single star means that there is some working code, but it's not complete, and two stars means that it's currently completely vaporware, but we know we're going to do this. Question marks means that it's not just vaporware, but we're also not sure that we're going to do that. But there aren't any question marks with this slide. Dictionary views is another thing that will impact many people's code. Basically [? take ?] the keys and items and values will return something that smells like a set rather than like a list. Comparing objects has also changed. At least the default comparison will sort of be more type-safe and less lenient. Probably one of the biggest things certainly in terms of implementation is unicode. We're going to move to a more Java-like model where all strings are unicode. And we have a separate bytes data type, which is more like an array of small integers than like a string. That means that we also have to implement a new I/O library, which I'm actually pretty optimistic about. Some things have already been done in integer unification, which means that there's only one integer type. There's no more long-- no more long literals. And you can get pretty close to that even today in Python 2.4. In 2.4 you almost never have to cast things to long anymore. And you don't have to cast them back from long to int because most of those conversions are taken care of by the system. In 3000, the long type will completely disappear. Integer division will return float. That's been a longstanding wish of mine. Actually you can turn that on in Python 2-- since, I think probably since Python 2.1 or 2.0 even. You could turn-- 2.1? Thomas knows everything. But not too many people use it. And then, of course, there's lots of other cleanups like string exceptions no longer exist, classic classes no longer exist, we're changing the race statement, and so on and so forth. So a little bit more on many of those items and a bunch that didn't make it to the highlights page. Print is a function. We had a discussion and there were a couple of competing proposals. One of the proposals was that they were going to make it a function. We should also drastically change what it does. Maybe not insert spaces between items, maybe have print f functionality. In the end, we decided to actually go with a very simple transformation where we have a print function that is just as convenient as the print statement is currently. So in most cases, all you have to do is put parentheses around it. By the way, you won't have to edit your code yourself. We have a conversion tool, and while the conversion tool is far from perfect, this is one of the things it can do really well. There's this funny business with a trailing comma that suppresses the trailing the newline. You can simulate that by-- print function will have three different keyword arguments. It will have END, which is the character that is output at the end of the list of arguments which defaults to a new line. There's SEP which is the thing that gets output in between items, which defaults to a space. And there is FILE, which is the file where it's going be printed to which defaults to whatever [INAUDIBLE] at the moment. So these three forms of print syntax all translate to very straightforward calls to the print function. And there's some functionality that you can't easily do with a print statement at the moment that you can do by setting, for example, the SEP keyword to an empty string. We can automatically translate this. The only place where that fails is in-- it turns out the print statement has a couple of bits of cleverness where it works with an attribute only output file named softspace, which is mostly hidden. But it's actually accessible to end users if you really want to. And the softspace attribute is used to delay the outputting of the space between items until you actually know that you have the next item. That is pretty murky semantics, and it means that everybody who implements a file-like object at some point finds that they also have to support the softspace feature. So I decided to just get rid of that. It does mean that there are few corner cases, like if you print a string that ends in either a newline or a tab character, and then comma and another item, the current print is cleverly suppressing the space between the two items. The print function will intentionally be slightly dumber about that. So I actually-- when converting the standard library in the standard unit test, I had to maybe-- I think I had maybe five cases where I had to fix this manually in the code. And usually it's very straightforward. So dictionary views. This has a star because the dictionary views currently, while implemented, don't quite behave like set objects yet. They can be compared to set objects, but they can't quite implement-- they don't quite implement all the operations that you expect of set objects like union and intersection. They do sort of have the basic functionality. You can iterate over them, you can do a member chip check, and you can compare them to another set for equality, which is actually a relatively big deal. In the past, if you wanted to see if two dictionaries had the same set of keys, you would have to make a copy of each dictionary's keys into a list and then sort the lists. Or make copies into sets if you were sort of using a more recent version of Python, like 2.3 which has a sets module, or 2.4 which has sets as a built-in type. And then you had to-- you could compare those two sets or those two sorted lists. The problem with that is that if you have a large dictionary, you end up making a large copy of all the keys. What you can do with the keys' view is actually, you can just compare the two keys and because they act as sets, it will automatically and efficiently compare whether the two sets have the same elements, whether one is a subset of the other and vice versa. Just a mathematical definition of set equality. We're doing the same thing for items. Items also returns the set view. Values, of course, could not quite return a set because you can't have duplicates. And you'd like those duplicates to show up when you iterate over it. We continue to maintain that the invariant-- that if you iterate in parallel over the keys and the values, that you get matching keys and values at the same position in the sequence, as long as you don't, of course, modify the dictionary while you're iterating over it. This, of course, has all been borrowed from the Java Collections Framework. I'm not afraid to borrow stuff from other languages. I never have been. I don't think I would've gotten anywhere if I tried to invent everything myself. So the important part of the keys-- the dictionary views in general, I expect that keys and items are going to be the most important ones and values are going to be only rarely used in practice. Mostly probably in unit tests-- that's where I found most of the uses. These view objects are very lightweight, because they're basically a structure containing one pointer which points to the original dictionary. So its rating over a key's view, or its rating over any of those three views was actually trivially implemented. Because even though I removed the iterkeys, iteritems, and itervalues methods, I didn't remove their implementations. And their implementations are still useful as the iterators over the view objects. So because I actually did some of the work on this over the weekend, there are two unimplemented parts of it. One is, as I mentioned, the set semantics are not complete. You cannot check whether your keys object is a subset of some other key's object or another set. You can only check-- compare them for equality. The other thing is that we currently have about 15 or 20 failing unit tests still. I expect that most of those unit tests are failing for very trivial reasons. I mean, what a lot of code does is it assumes that keys returns a list. And then it compares that-- the unit test, especially, often do things like they create a little dictionary, they mess around with it a little bit, and then they test that the list-- the keys after sorting have a certain value. And they usually just compare the keys object with the list of constants. That doesn't work anymore. You could fix that in two ways. You can explicitly cast the view to a list object. That sort of fixes it solidly. You can also replace your list constant that you compare it with a set constant. Which I haven't mentioned yet, but which is one of the later slides. We have to set literals now. So much for dictionary views. The default comparison, I already mentioned that in the highlights. Equality and not equal, of course, compare whether-- I mean, you have a default comparison and you can overload your comparison. You can implement your own comparison any way you want it. I'm not touching any of that. But the default comparison that you get when you derive from object and you don't overload any comparison operators is changing quite a bit. In Python 2, even in Python 1, even in Python 0, I think, if you compare two objects of different types with an ordering relationship, and we just compare the address of each object and say the one with the lower address comes before the one with the higher address. That turns out to be mostly a useless comparison. It can give you sort of a false sense of security that if you sort or compare something and you don't know what the types of the objects are, it's not going to throw a type error. But actually you want to throw a type error. Because most of the time, if there are objects of different types that aren't really comparable, that haven't explicitly programmed how they should be compared with each other, the default comparison is just giving you random results. And maybe in one run, this object always shows up before that object. But another run, because you have slightly different input data, their allocation on the heap is different and the object that was smaller first is now suddenly larger. And you can have all sorts of bizarre situations where you have flaky unit tests. So in particular, this means that you can no longer compare, or sort, integers and strings, just like you can't concatenate them or do anything else with them before converting. That's pretty much it. In practice I have not found that this affects much code. I mean, I have found very little code in the standard library that actually relied on this default comparison existing, except again in unit tests that were specifically checking this behavior. Which always feels good to rip out code. Then we get to the scary thing. And it's scary for me because I haven't started implementing it yet. I think it's also scary for application developers because it can potentially affect application performance, application semantics. It's going to be one of the bigger things for converting code to Python 3000. If you're not using unicode today in your application, you're probably pretty safe. If you're using unicode today, sort of everything you know about keeping track of encodings, and which strings are unicode which strings are not unicode, will probably have to be changed somewhat. So again, we're borrowing heavily from Java. there's going to be one string type named str. But again, it's implementation will be most likely that of the 2.x unicode implementation. And we'll have a separate bytes type which is new, brand new. Although it's implementation resembles most closely the array module that has been around probably since Python 1.5 or so. You can only ever go between these using an encoding. If you compare them or concatenate them-- if you compare a bytes object to a string object, it will just throw a type error. This is yet another place where the change due to default comparison is actually helpful, because it just points out that you're doing nonsensical operations quicker. What will completely disappear, and this is actually a big improvement and the main motivation, is the endless problems you have in current Python applications that use a mix of 8-bit and unicode strings. And occasionally, encoded unicode ends up in 8-bit string, so you have characters with a high bit set, and then suddenly they will not interoperate happily with actual unicode strings. The thing is if you have an 8-bit string that only contains ASCII characters, you can concatenate it or compare it to a unicode string just fine. And it will have sort of the proper semantics. But if you have an 8-bit string that actually uses bit number 8 of at least one of the characters in that string, you suddenly cannot compare it or concatenate it to a unicode object. And unfortunately, this often happens after your application has been deployed, especially web applications. The developers live in the US. They do a lot of testing, they type in their name, and there's never an accented character around. Then their first French customer enters their login name and everything blows up. Painful. So we hope that by forcing you to sort of do all the conversion between bytes and unicode at a much more specified point slightly earlier in the life of the strings, you won't-- I mean, you basically-- if you make a mistake, and you do not explicitly convert your bytes to unicode, typing a name without accented characters will also not work. So you're much more likely to actually have effectively tested your application for all use cases. This has caused a lot of discussion, and I think that's still an understatement. There are lots of different implementation choices. My personal choice would be, we'll go with basically the unicode data type that we currently have in Python 2.-- well, since Python 2.0 it hasn't changed a whole lot. It uses an internal representation that is either two bytes per character or four bytes per character. When it's two bytes per character, technically it's UTF-16 because you can have surrogates in there, if you care about that. But the surrogates for most practical purposes look like characters through the application unless you really go to dive deep into unicode. That is one possible implementation. Another possible implementation would be to keep a similar thing, but actually have three internal representations. One that is a single character wide-- single byte wide, one is two bytes wide, and one that's four bytes wide. This means it's less easy to use some of the C Standard Library that might exist, or extensions of the standard library that might exist for working with unicode characters on a particular platform. On the other hand, it means that you would never have to worry about surrogates, because the surrogates would always be converted into 4-byte characters. It means that if you have a string that contains 1 character that doesn't fit in 2 bytes, the entire string is 4 bytes per character. That's a compromise currently. You can compile Python in such a way that all unicode characters are 4 bytes wide. It's sort of a cultural was choice whether it's worth you having the bytes-- having the characters be wider and not having to worry about surrogates. The C API issues frankly are a mess. I'm not going to spend much time describing that here. Generally, my approach to Python 3000 is first I want to get sort of the Python programmers APIs cleaned up. And while it's too bad if extension writers will temporarily have to deal with a sort of slightly messy set of APIs, I mean, at C-code you're used to things being messy. There is a different faction in the Python developer community, or at least in the people who are quite vocal in the Python 3000 list, which is not necessarily the same, who would like to see things like-- well, the most extreme view is actually support variable length and codings as the internal representation. For example, if you have a large file containing unicode data, you might want to read that into something that calls itself string object but actually still contains unicode-- UTF-8 bytes internally. The problem with that is now I have 10 megabytes of UTF-8, and I have a program that sort of tries to walk through that code from the end or just randomly accesses byte 7 million. There's no way to find out where-- sorry, character 7 million-- there's no way to find out where character 7 million is without parsing the first 7 million characters. You could try to optimize that. Keep a cache of a couple pointers, but it gets messier and messier and more and more complicated. I'm not sure that that's at all a viable idea. Maybe someone can prove me wrong by actually coming up with an implementation, but I'm skeptical. A slightly less ambitious, but still very controversial idea is to optimize things like slicing operations and potentially also concatenations, so that if you-- for example, if you have a slice-- you have a string of 10 megabytes, and you take a slice of four megabytes out of that string, currently Python always copies. You could say, well, let's just share that array that already contains those bytes. I mean, after all, they're immutable objects, they can't change. Once they've been read into memory, they are there. The object's not going to move. Unfortunately, most of the implementations of that idea are very easily lured into a worst-case behavior, where you do something like you read repeatedly-- you read a megabyte string in and you slice 30 bytes out of it or something. And so now you have a 30-byte object-- a 30-character string object that references a slice of a megabyte-long string object. And you can't deallocate that megabyte until you deallocate 30-byte-- the 30-character string. And you can try to work around that with heuristics, like if the slice is really small, you copy anyway. Or if it's small relative to the size of the original, or you can try to use weak references to sort of dynamically copy. And the only effect of that is that you have more and more code that could go wrong, and less and less actual performance benefit. So I think in the end, the approach of very straightforward, simple algorithms that always copy is still going to be a winner. But I'm trying to keep an open mind about this. So the bytes type-- the best way to think of it is a mutable sequence of small integers. So it behaves a little bit like a list, but the values you can store into it are limited to being integers. The have to be positive and they have to fit in a byte. It also behaves a little bit like a string. There's a bunch of string methods that make total sense for byte arrays like find. On the other hand, certain string methods that are locale dependent or character encoding dependent will definitely not be allowed. Like you will not be able to lowercase or uppercase a byte string-- a byte array. To go from a byte array to string, use this encode method. To go from string back to byte array you use this decode method. And those always require an encoding parameter. If you want some kind of default encoding, you're going to have to dig it out of the environment yourself. Bytes type has actually been implemented. Some of the string behavior probably still needs to be added, but in general I'm pretty happy with it. You can actually already use it for I/O in limited situations. So that's a nice segue to the new I/O library, which is yet another idea inspired by Java. And you could also say it's inspired a little bit by Perl, which also has stackable components in its newer I/O library. So at the very low level, you can read bytes from-- well, from a file descriptor, a file handle. On Unix it's going to be a tiny object that wraps a Unix file descriptor. On Windows it's going to be a tiny object that wraps a Windows file handle. It provides read, write, close, seek and tell methods. There is no buffering going on and it always talks in terms of bytes. It doesn't do any carriage return line feed conversion either. If you start on a brand new platform that is not at all like Unix or Linux or Windows or Mac, you're going to have to provide your own low level byte I/O implementation. Most likely there's actually a Unix emulation library that you could probably use, as long as you can turn off any character translation features it might have. I mean, that's a possibility for Windows, too, but on Windows there are actually slightly lower level things that are more efficient and more flexible. But that's the only thing you have to do for a platform. I mean, buffering, unicode, encoding, decoding, carriage return, line feed translation-- all those things can then be built on top of that without any platform specific stuff. Using this, I expect that in most applications, unless you are doing very messy stuff where you're sort of not sure whether you're reading binary data or text, which of course happens, you will not have to change your program. The open function will continue to return the file object. You can tell it to open a binary file or a text file for reading or for writing. All those things will still work. However, if you open a text file, read and write will use strings. If you open it as binary, read and write will use byte arrays. So that's probably-- if you're doing binary I/O, you're more likely to have to change your code than if you're doing text I/O. Now how does it decide on the encoding when you're doing text I/O and you don't specify the encoding as an extra open parameter? Open will have a keyword parameter that will let you specify an encoding, but if you don't specify that, it's going to pick a default. And I can imagine a number of different ways of picking a default. You could say, well we'll pick ASCII, or we'll pick UTF-8, or we'll sniff the file and actually see whether it looks like UTF-8 or UTF-16, little-endian or big-endian encoding. You may try to see what the user's environment says about file formats. There are a couple of different ways. I mean, if you're dealing with a TTY device in the Windows environment, I think the TTY device actually knows what encoding to use. So that would be another way to get your encoding by default. I expect that when you're opening a file for binary I/O, you will not be able to use the ReadLines or ReadLine methods. Unless it turns out that a lot of code breaks-- I mean, I don't actually honestly know if there is much code around that has a legitimate reason for calling ReadLine on binary files, but there might be. So we'll see. An interesting thing is also how you're going to tie these things to sockets. But I think all the socket has to do is provide little [? wrapper ?] that implements the same read write operations that their lowest level of binary I/O object does. And you have to somehow decide on what your encoding is. [INAUDIBLE] or ASCII or something else. And then you will be able to read and write from sockets. By the way, we're completely weaning ourselves off the C Standard I/O Library for a number of reasons, mostly having to do with the C Standard I/O Library not actually always providing the functionality that we need. Like it provides buffering, but it doesn't provide an API to see how many bytes have been buffered, if there's anything buffered. It doesn't have a way of [? peeking ?] in the buffer. We need those things. There's also this thing that the C Standard I/O Library says that basically you could expect a [? seg fault ?] or World War III when you read, and then suddenly you start writing the same file descriptor even if the thing is suitable for reading and writing. You still have to seek when you switch from reading to writing. Since the Standard I/O Library-- the C Standard I/O Library doesn't promise that you get a neat error message when you forget to seek in between, that's a really unpleasant thing for Python to have. So Python has to keep track of, are you reading or are you writing. So we end up sort of redoing too much of the C Standard I/O Library's functionality anyway. So we'll just throw it out and hopefully have a bigger and better implementation. So int/long unification is a really simple thing. Currently, Python has small ints named int, and large ints named long. The large ints are actually arbitrary precision, so you can represent numbers as long as they fit in memory. The small integers are actually mapped to C long, so they are 32 or 64 bits depending on what kind of platform you have. That was really a mistake, and I made that mistake sort of very, very, very early on in Python's design. And over the years we've made more and more compromises where you can use int and it will actually behave as if it were long if it doesn't fit. Like in older versions of Python, if you kept multiplying numbers together and the results got bigger and bigger, at some point you'd get an overflow error. In modern Pythons-- I think it started in Python 2.3 or so-- certainly in Python 2.4-- when the result doesn't fit in 32 bits or in 64 bits in some platforms, you'll just get a long integer. And more and more places, if it doesn't fit in a small integer, we'll just give you a long integer even to the point where if you call the int function, and somehow the int function can do a couple of things. It can convert a float, or a string, or another integer to an int. In most of those cases if the result is a valid integer, but it doesn't fit in 32 bits, so it's a valid mathematical integer, nowadays int will just return a long object. And so the only place where you're still aware of the difference between ints and longs is if you're explicitly checking the type of your objects. If you say, if it's instance x comma int, then do this, otherwise do that, then your code won't work when someone passes you a long, even if it's a long containing a very small value. So the long thing becomes less and less useful. And in Python 3000, we're just throwing the type out. We looked at the number of different implementations. What we chose was actually taking the long implementation and renaming it to int, at least at the Python level. In the C level, the distinction between long and int is still very much visible. We did have to optimize it a little bit, because the int implementation was traditionally very optimized, like it has a cache of small integers and a couple of other allocation tricks. The long type was completely unoptimized. I think the new long int type is somewhat optimized. At least it has a cache for a small values. We're probably going to try to get that performance back up to speed comparable to the best performance in Python 2.x during the year after the 3.0 alpha 1 release. I don't know how close we'll get, but I'm hopeful that some smart people will be able to do magic there. And it makes life for the programmer much easier because you know you can actually write if instance x comma int, and it will do the right thing. Unfortunately, I have no idea what time it is. I'm worried that I might actually-- 15 minutes? Oh, excellent. AUDIENCE: You can usually run over, and nobody cares. *** VAN ROSSUM: Except the tape runs out. AUDIENCE: [INAUDIBLE] *** VAN ROSSUM: Doesn't matter. I'll try to be done in 15 minutes. I think that's OK. So we have integer division. And again, that was a very early mistake where I sort of mindlessly borrowed behavior from C. If you divide 3 by 4, it gives you 0. It turns out that certain algorithms really sort of find that a *** trap waiting to explode when you least expect it. So we're going to make 3 divided by 4 return 3/4 in some kind of float representation. And you can use double slash if you really wanted that 0. Now that double slash operation has been in Python 2.x probably since 2.1 again-- 2.2, OK, I believe you. So you've had plenty of warning and there's also an option you can pass to Python 2.x that will tell you when you're using the single slash operator, and it is used on integer operands. So changes to exceptions. We're getting rid of string exceptions. We're also enforcing that all exceptions derive from a single root exception type which is called BaseException. In practice, you should derive all your exceptions from exception which is slightly lower in hierarchy than base exception. But you can, if you know what you're doing, derive from BaseException. Also, we're going to move the traceback into the exception object. Again, I should mention, this is an area where Java has been leading. We're cleaning up the raise statement. There are two different ways of raising an exception with arguments. You can say raise E parentheses arguments close parentheses, or you can say raise E comma arguments, or arguments in parentheses even. But second syntax was only necessary back in the day of string exception, so we're getting rid of that. If you want to pass a traceback, you call a method-only exception object that you already created that sort of adds a traceback object. Which also, changing the except clause, when you're catching exceptions there's a pretty common mistake where you wanted to catch two exceptions but you forgot to put parentheses around them, and now you're catching the first exception, and when you catch one, a local variable is created with the name of the second exception. In order to prevent that, we're going to use-- instead of a common between the exception and the variable, we're going to use the keyword S. Also new is-- and this has to do with the exceptions now sort of containing the traceback as an attribute. We're going to delete that variable, if it still exists at least, at the end of the except block. We're basically going to put the [? try ?] finally in that block that you won't see but will be there. That leads the value if it exists. Which means that if you want that value-- if you want that exception value to survive beyond the except block, you have to just assign it to a different local variable. So we're not going to do optional type checking, but we are going to add some syntax that will allow other people to implement frameworks that do something like type checking, or whatever they would like to do. Basically, currently every parameter of a function has a default value. Well, it can have a default value. We can now also associate an annotation with every parameter. The annotation is introduced by call on the default value, is of course introduced by an equals sign. You can combine those, the call on annotation equals signs expression notation. You can also annotate the function returned value with an arrow. All those things are evaluated when the function is defined. So at the same time the function object is created, both the default and the annotation, which are just generic expressions, I have no constraints on that, but they must-- if they reference variables, those variables must exist at that point in time. And then you can pull those annotations out of the function object by asking for the func annotations attribute of the function. And that's just the dictionary indexed with variable names and the keyword return. If you want to do something with this, you'll have to do it yourself. I can imagine all sorts of decorators or metaclasses that make good use of this to enforce all sorts of things from actual type checking to automatic adaptation and a number of other interesting things. I'm not going to put anything like that in the language, at least not in 3.0. Another small change to function signatures, completely independent from the previous one-- both of these have been implemented by the way. Sometimes it's really helpful to have a parameter that is required to be used as a keyword in your call syntax. If you really want to enforce that in Python 2, you can use star star keywords and sort of pull it out of the star star keywords dictionary. But it's kind of messy, and you have to sort of check for each of the keywords that you might expect and check that there isn't anything else in there in order to be sort of robust and user friendly. Now you can just use this strange notation where there's a star without-- I mean, the star, of course, normally means star-- you can use it already as star args, which means we have a variable number of positional arguments here that gets returned as a tuple. Now if you leave the name out from that syntax, you just have a star without the star args. And then you cannot specify arbitrary positional arguments, but you can, after that, specify more arguments that will then be required to be keywords. And they don't even have to have to have defaults. So after that star, you could have c is 42, so that's an optional keyword parameter, but d doesn't have a default value, so that's a required keyword parameter. So every call to foo in that case must specify a value for d, and it must specify it using the keyword notation. Set literals, very simple. You put a number of expressions in curly braces and it creates a set object. Except if there is nothing between the curly braces, it still creates a dictionary. At some point I tried to propose to unify the dictionary and the set object. That didn't get a lot of support from the developer community. If you really want frozen sets, it turns out frozen sets are only very, very rarely used. You'll have to cast that thing explicitly to a frozen set. Or, of course, you can use a frozen set with a list argument. We're also going to implement set comprehensions. Those are not yet in the code base. It works the same way as a list comprehension except it returns a set. Absolute import-- you can already do that in Python 2.5. From [? under ?] future import absolute underscore import-- that means that if you import a module using import foo or something like that, inside the package, normally in Python 2.4 and before, it first sees-- tries to find that foo in the package. If it's not in the package, it looks [? insist ?] the path. In 3.0 or in 2.5, if you have that future statement in your module, it's not going to look in the package. That solves a particular ambiguity where you might have a module in your package that has the same name as a module in the standard library-- the top level module in the standard library. Currently, without this future import, there's no way to reach out and actually import the standard library module because the one in your current package will always be seen for first. Well, you could dig it out of system modules, but only if it's already been imported by someone else. If you want to say, I definitely want the foo, it's in my package, rather than potentially the one in [? system ?] path, you can say from dot import foo. That's also already in 2.5. The only difference, really, is that in 3.0, you always have that future statement automatically implied in your code. Exec-- very early Python versions this actually was a function. It takes an object which is either a string or a code object or a file, and then optionally globals and locals. At some point I thought that the compiler could make good use of the fact that you were using exec somewhere in a function and I decided that in order for the compiler to know about it, it would have to be a statement. Well, compiler technology has advanced a little bit. And you can actually tell fairly reliably whether you're using a function like this. So there's no need for it to be a statement, and it's actually easier to have it as a statement. Sorry, as a function. So it's back to being function. The interesting thing is this is very easy to do in Python 2.x also, because since it once was a function, that same syntax with a tuple of up to three values is also still supported in 2.x. So range-- just like we have keys and iterkeys, we have range and xrange. Because range was there first, range creates a list of many integers, potentially many. Xrange produces only the integers that you ask for. So we're going to change that so that there's only going to be a function named range, but it will behave mostly like xrange. The difference is the current xrange is optimized so that it actually only works for integers that are less than syst of maxint. And Neal Norwitz has a patch to fix that, but I'm still waiting for him to upload the patch or something. Zip-- this is actually a pretty minor issue. Zip is something that would be a very good candidate for returning an iterator in Python 2 when it was-- except it was introduced before iterators existed. So there's an itertools, that izip thing that does return an iterator that makes much more sense for zip to be an iterator in the language. So string formatting has a couple of problems, and there is a PEP which I hope will be implemented. I'm certainly in favor of the proposal to give strings a dot format method and to use curly braces instead of percent something as the indicator for replacement in the format string. Here, quickly, are a couple of examples. You can specify format arguments by positions, 0 and 1, or by name, foo. If you want to include little curly braces you can double them. You can even access attributes or use get item dictionary notation in simple cases on the formatting object. You can also specify parameters after a colon. I think that is actually borrowed from dot net. Although I'm not sure that we are taking exactly the same notation. Read the PEP if you're interested. So this is something that's actually probably not going to make it, but I'm mentioning it anyway, because it is potentially an interesting feature. It's just there are a couple of difficult decisions to be made. I mean, it's very easy to come up with a decent switch style syntax. You can say switch expression, case expressions, blah blah blah. The question is, when do you evaluate the case expressions. In order to actually benefit from potential speed up, like you could do it as patch-based on the dictionary, you would like to precompile those case expressions. For example, compile them at the time the function is defined rather than each time the function is invoked. But that limits you to actually constants. And that's not a concept we currently have anywhere else in the language, which makes it somewhat problematic sort of conceptually. Which is why we haven't implemented it yet and it's marked with both stars in question marks. Another thing that is more likely to make it, even though it's slightly ugly, if you have a function, an inner function that references a variable defined in an outer function, you can use it, but currently you cannot assign to it. You can modify it if it's a mutable object, like if you have a list object checked in the outer function, you can append to that list or even index it and change an element of that list. But you cannot replace it with a different list object using plain assignment. Turns out that there are enough places where people would like to have that functionality. And we had a long discussion where Ka-Ping Yee Yee did a brilliant job of summarizing the discussion and sort of guiding it towards perhaps not final completion, but at least closure so that everybody could agree with what was written down in the PEP. We're pretty much settled on the syntax and on the semantics. The only thing is there are different flavors of keyword that sort of each have their own advantage and disadvantage. Nonlocal is the current favorite. It's sort of ugly because it's a long word and it has sort of a negative meaning. Unfortunately, the only real contenders were global and outer. Where the problem of global is that, global for most people's minds, has fairly set semantics, which really doesn't mean just go search outward scope by scope by scope, but really go all the way to the outermost scope, the global scope. So that's why-- even though global was my favorite, nobody else seemed to like it very much. And I have to respect my users. Outer was a nice candidate until we found how often that word is already used as a variable name, or function name, and that made it much less attractive. So it's probably going to be nonlocal, which is not something people tend to use a lot as variable names. So another very speculative thing is abstract base classes. And we had long discussions about interfaces, generic functions. Abstract base classes actually, the more I think about it, the more attractive they look from the perspective of a somewhat voluntary declaration of, I implement a particular protocol. But protocol is a very informal concept. We've had the concept like protocol in Python for a long time. We've been talking about sequences and mappings as sort of implementing certain operations and not others. The problem is if you have an actual object and you don't know whether it's a sequence or a mapping, there's not really a good way to decide which one it is. You can check whether it has a keys method, but there are actually some cases where you have something that really behaves like a mapping, but it maps an infinite number of keys, and you really don't want to implement a keys methods that tries to enumerate all of them. So if there was an abstract base type that didn't provide any semantics or implementation, but just serves as a marker class, I am implementing the sequence protocol or I am implementing the mapping protocol or I am implementing the file protocol. And it's probably going to be a couple of-- there is going to be more fine-grain distinctions, like you have readable files and writeable files, and readable and writeable files, and you probably have mutable sequences and immutable sequences, and very basic mappings that only implement the map operation and sort of very complete mappings that implement lots of other functionality like update and keys. But if I get time between now and April, I'll write a PEP about this and then implementing it is going to be simple. But this is something that just adds some stuff. It's going to be easy to make all the standard types declare what stuff they implement. And then it's just up to user code to voluntarily follow this. I mean, we won't stop you from implementing sequence protocol methods without declaring that you're a sequence. But the carrot in this case is that if you want to interface with a large framework like SOAP or Twisted or something like that, it might be that eventually future versions of those frameworks that work under Python 3.x will actually, instead of sniffing which methods are implemented, will actually just look at the base classes. That's the hope, anyway. So I'm going to skip the miscellaneous changes. You can get the slides from the web eventually. This is mostly cleanup of very small stuff. Library reform is not my own idea of fun. I like to focus on the language. Language is big enough that-- other people are interested in reforming the library. There's currently not a lot of activity going on. It's certainly something that I think is a fine project to do after we've released the alpha 1 release of the language. So again the C API-- I'm currently not too worried. I'm just randomly changing the C API as object types change. Of course, if you're writing a third party extension that's not already part of the Python source tree, you would like to know what's going to happen. At this point, the only thing I can promise is I'm not going to change functions to have a different signature but the same name. Or different semantics even with the same signature. I'm going to add APIs, I'm going to delete APIs that are no longer relevant or impossible to implement. I'm not going to change APIs in an incompatible way that would break your code. I am going to require everyone to recompile their code. That's the minimum I can expect. So if your compilation passes, you're somewhat likely to actually have a working extension. Best case scenario. If you're using APIs that no longer exist, you'll get a clear compile time error about something that doesn't exist or maybe a link time error. So now you have a bunch of Python 2.x code and you want to turn it into Python 3.0 code. Well, you could just try to run it with 3.0 and fix all the syntax errors and then fix all the runtime errors. Hopefully you have unittests. That's going to be pretty tedious because there-- even though the general flavor of the language doesn't change much, there are clearly a lot of small changes that really add up. Classic classes, except S, different race syntax, no comparisons, keys, dictionary views is going to affect lot of people, print statements, of course is going to affect a lot of people. Unicode is going to be a major deal for at least some people, so there is a conversion tool. Now, we cannot do a perfect conversion because in some cases it's inevitable that you have to do a symbolic execution of the application in order to find out what the types of a particular variable are before you know how to convert a particular call. I mean, if I say x dot keys, there's no guarantee that x is actually a built-in dictionary. It could be a completely unrelated object that has a keys method. However, there's a good chance that it is a dictionary. If you have something that has an iterkeys method, there is an even bigger chance that it's a dictionary. So what we're doing is we have a tool that parses your code and looks purely at the parse tree and is able to transform that parse tree in place and then write it back out. And we annotate this parse tree with exactly where the white space is and where your comments are. So in theory, certainly if I know I have tested that-- if you don't make any transformations, it's always the output is exactly the same as the input. Every single white space character. That's conversion as perfect. Now if you make transformations, sometimes it's possible that you would lose a comment if that comment sort of is in the middle of an expression that gets completely discombobulated and transformed into something completely different. That's not very likely to happen, because how often do you have significant comments between the parameters of a function right after a binary operator. Not so common. So if you're interested in looking at this code, currently you have to go to svn.python.org and find a sandbox code and go to the 2to3 subdirectory. It's relatively easy to add new conversions. I mean, I've had a couple of Python developers who started contributing conversions actually. That's been really great. The idea is you write a pattern that decides, I want to match certain notes in the parse tree that look like-- that match the pattern. And the parent completely ignores what the comments say. It purely looks at what the parser actually sees. So there are really two parts to the parse tree. There's the annotation for white space and comments, and there is the syntactic tokenization and parse. So the matching is purely concerned with matching nodes and leaves in the tree. And I'll show the pattern-- syntax in a minute. So you write your pattern, and then you write a transformation function that sort of picks the node you find apart and puts it back together in a different order and returns that new node. And some caveats-- then there's a framework that does all the rest of the work, like traversing the entire tree looking for all the nodes that match the pattern and calling your transformation on each of those. Sort of a separate strategy that is also going to help is Python 2.6-- by default it will just be Python 2.6, but it will have an option where it will warn about things that will go out of style in Python 3000. It will probably also backport certain Python 3000 features so you can start using those. I don't want to give examples because not much of that has actually been implemented. Maybe Thomas can talk about that next week. So here a couple of things that the transformer is really good at. It can take a call to apply and turn it into the more modern notation using star args and star star keywords. And as long as you don't have a local variable named "apply," this is going to do the right thing. And it will put extra parentheses around the function or the arguments, if necessary, to make sure that it doesn't sort of get affected by nearby operators. Slightly less perfect but still pretty close, it turns everything that says iterkeys into keys and iteritems into items. It can also do a really good job with exec. It can do a really good job with print. It can do a really good job with except clauses. It also recognizes has key, assuming that you don't have, again, a user object that happens to implement has key. I found one example in the standard library where the BSD wrapper library actually has a two-argument has key where the second argument I think passes in transaction state. So I'm not quite sure what to do with that. So just don't convert that one. But otherwise, [? turning ?] d dot has key k into k in d, again making sure to parenthesize subexpressions or the whole thing as necessary based on the context. So it doesn't add parentheses unless they are necessary to disambiguate stuff. On the other hand, if you have redundant parentheses in your input, you will have the same redundant parentheses in the output. It's very simple to turn the less than, equal than-- sorry, less than, greater notation for unequal into exclamation point equals sign. That could turn back ticks-- can even turn int into long. I've found that actually these things were not quite enough to get most of the unit tests we were-- to pass. The problem is that a very popular testing framework in Python is called doc test. And it works by having documentation strings so they're just string literals due to parser containing fragments of Python sessions-- interactive Python sessions that, in theory, you could just cut and paste them out of your shell window into your Python source code. And then there's a framework that automatically tests-- it's sort of a regression framework that checks that those examples still have the same output as they had when you pasted them in. Since all this stuff is inside string literals, it's not so easy to see how we could convert those, because we can't just go scan all the string literals and assume that they contain Python code and turn everything that looks like a print statement into a print function call. However, what you can do is-- it turns out that, at least for the doc test stuff, doc tests are pretty recognizable, because they have to start with a Python problem, three greater than signs, and if there are continuation lines they have to start with three dots, and they all have to be sort of indented the same way. So with very great reliability I parsed the doc test out of the source file. You have to actually run the tool a second time-- maybe eventually I'll combine that-- currently you have to run the tool a second time. And it will just scan the source code looking for doc tests. And this was a great relief. I mean, at some point I was a little panicky because I realized how much unit testing code I would have to convert manually. And then I realized I just have to do this. The only place where it broke down tremendously was the doc test for the doc test module itself, which applies this trick recursively. There, I just ran the test and sort of fixed the thing manually until it worked. Nothing else I could do. Now there are also a whole bunch of things that this conversion unfortunately cannot do. If it sees d dot iterkeys, it has no way of knowing whether d is actually a dictionary. If it sees d dot keys, it has no way of knowing whether you're going to expect that thing to be a list or not. It has no way, if it sees x slash y, whether you meant that to be-- whether when you execute that code, x and y are integers or not. So it's not able to sort of turn that single slash into a double slash. It can't find code that somehow depends on being able to order objects of different types. It certainly doesn't clean up your code or remove redundant definitions. If you write your own code that emulates a dictionary and reimplements the mapping protocol, it's not going to touch that. It's also not going to fix your string exceptions. Basically, all it can do is match on a parse tree. Stuff that you can reliably or mostly reliably fix by looking at the parse tree only is good candidates for this tool. I don't know if that's going to be enough. Maybe at some point, we'll have to add understanding of variable scope and things like that so it can actually tell whether a particular occurrence of a variable named "apply" is, in fact, the built-in in function "apply" or not. I'm currently hoping that we won't need to do that. Otherwise we would somehow probably have to merge this tool with pi checker which would be quite a refactoring. So if you're interested-- I'm actually probably going to skip this-- this is what the matching notation looks. You basically-- you use the names that are also used in the grammar file. Python has its own grammar file. Here's a couple of examples. Power is a token-- power is an atom followed by 0 or more trailers, and then optionally followed by a double star and something called a [? factorer. ?] And there's a couple of alternatives for what an atom and the definition of what a trailer is. And there's like several hundred lines like this that make up the entire Python syntax. So our conversion tool actually reads that file with the Python syntax at the start of a run. And builds a parser customized to that syntax. So it's very easy, actually, to change the syntax that the conversion tool uses, but you just have to edit one text file. The trick I use in the patterns is I use the same notation as in the grammar. I actually use regular expression notation. So you can match here a pattern-- the pattern power, and then the angular brackets are actually sort of, they specify-- and inside this node labeled power, I must match the following thing. So this is-- we want to match a power that starts with, well, one or more nodes of any type, but they must be exactly at that level. And then a node of type trailer with a particular substructure, namely the trailer alternative that has a dot followed by a name. And the name, in this case, must be dot iteritems. And then it can have more trailers. That's an example of a matching rule that's close to actually the rule I use for fixing iteritems. So if you have that expression a square bracket 0 square bracket dot iteritems paren paren, the parser sees that as an atom containing a, and then a node that's a trailer, that's the square brackets, another node that's a trailer, the dot iteritems, and another node that's a trailer that's the parentheses. And that happens to match this pattern as follows. The first two together actually match the any plus. Then follows the trailer which happens to match a trailer with that particular substructure. And then the final trailer it matches the trailer star. And you can nest these things as much as you want, and it's relatively efficient in just traversing the three and finding matches. What your transformation function gets is it gets the node that matched the top level of the pattern. It also gets a dictionary containing elements, sort of subnodes of that node. And what I didn't show-- what I'm not showing here is you can add names to any particular section of the pattern. You can say, oh, this subpattern, call that foo. Or call this other subpattern bar. And then you can sort of pull all those subnodes that match those subsections out, and you can rearrange those in a different order. That's, for example, how you do the apply thing. So here's the slide that you're all waiting for. What can you do today. Well, my first recommendation is don't worry about the changes that the transformation tool can actually take care of. I mean, my first version of this slide actually started out with, OK, so use star args instead of apply, and use raise exception, parentheses, parentheses instead of erase exception comma value. And then I realized, no. You shouldn't have to worry about all the stuff that we can transform through syntactically. I mean, it's unlikely that you'll be able to write code that is both valid Python 2.6 source code and valid Python 3.0 source code. So you're going to have to run the transformation tool anyway. What you can do is make things easier so that after you've run the transformation, you actually end up with working code. Using Python 2.6 means that you can use Python 2.6 as warnings to find certain things that the transformation tool cannot handle. It's always a good idea to have unittests so you can sort of see if the semantics of your new code is still what you expect it to be. And then there's a couple of things that the transformation tool does not handle. Like if you extract the keys from a dictionary and then you sort the resulting list, the transformation tool is not smart enough to correlate that the variable you assigned on line 1 is being sorted on line 27 or on line 2, even. But you can write today, you can use the built-in in sorted function, which is available in Python 2.4 and up. And then you have code that can be easily transformed correctly. And similarly, if you really have a good reason to want the return value of keys as a list, call list and pass it the iterkeys function. The iterkeys will be transformed by the transformation tool and so it will still be a list and it will be just as efficient in 2.6 as in 3.0. Another thing you can very easily do is make sure that all your exceptions are actually using classes derived from exception. You can also make all your classes that aren't exceptions-- that don't have a base class-- derive them from objects of their new style. There are certain semantic differences between classic classes and new style classes. By converting them to new style classes, now you catch those semantics while you're sort of thinking about it. And then with print, don't worry about the print syntax. I recommend that you just use the print statement and rely on the transformer to turn them into function calls when the time comes. But be aware of the two cases where the transformation tool doesn't do the right thing, which has to do-- I think I showed it on the slide about print. If you have a string ending in a new line or a tab. Another thing you can do now is make sure that your code uses a double slash where you expect an integer division. So now we have-- well, in theory, we have five minutes for questions, if anybody has the energy. Yes. AUDIENCE: You rejected the idea of using UTF-8 or UTF-16 for strings, because that would make random access impossible to implement in [? more than ?] one. I am curious why you think that order 1 random access is a necessary property of strings to individual characters. *** VAN ROSSUM: So the question is why do I not want strings to use internal UTF- 8 or 16 representation, and why do I think that order 1 indexing of strings is important. I think because it's a tradition in Python, unlike some other languages, that we actually write a lot of code that sort of traverses a string and keeps track of a particular index. There's just lots of code that indexes the string. I mean, it's very common to say that if s dot ends with dot p y, return s sliced from zero through len s minus 3. That's all I can say. It's sort of, common idioms in Python code are using slicing which uses numerical indices quite a bit. And pattern matching is used much less. AUDIENCE: [INAUDIBLE] *** VAN ROSSUM: OK. So the question is, can the transformation tool potentially be abused for other purposes. I think it definitely can. There's nothing that says you have to use it to transform it to-- you don't have to use it to transform Python 2.x to 3.x code. I mean, you can make the input syntax whatever you want it, and you can slightly alter the driver so that instead of transformations, you just get error messages if you match certain patterns. That's an excellent idea, actually. AUDIENCE: [INAUDIBLE] *** VAN ROSSUM: I didn't get the last few words, but your question is, did I consider some other string abstraction that would not make it necessary to rely on the indexing so much. AUDIENCE: As an adjunct to [INAUDIBLE]. *** VAN ROSSUM: Oh, I see. Your question is specifically could we have an additional string class that has sort of a different model. That's a reasonable question. I hadn't really considered that. I see it as a library issue. I think I would encourage people to sort of write custom string classes that might be more efficient for certain situations. And you can probably write them by-- you can implement them in Python by using a byte array and a thin layer on top of that. Or if you're really interested in super performance, you can, of course, do it all in C. I mean, that's the beauty of an extensible language. It doesn't all have to be in the standard library. In the back? AUDIENCE: [INAUDIBLE] *** VAN ROSSUM: Sorry. Could you speak up? It's getting noisy. AUDIENCE: [INAUDIBLE] *** VAN ROSSUM: OK. So yes. So the question is, there's going to be a long period where library developers-- third party library developers especially-- will sort of be required to maintain a 2.6 and a 3.0 version of the same library. Or maybe even going back to earlier versions than 2.6, is the expectation that they limit themselves to code that can be automatically transformed to 3.0. Expectation is a strong word. I would recommend that because I expect that that is the sort of least painful way for library developers to go. Of course, if you have an existing library that has backwards compatibility requirements going back to Python 2.2 or some time even before, it becomes gradually harder to maintain your source-- that code in a form that can still be transformed. I mean, if you're in the lucky situation that you can actually say, 2.6 is the oldest version of Python I support, then at least you can use some of the 3.0 features that will be backported to 2.6. But I think the syntactic conversion approach will work. I mean, there's no reason that the transformer couldn't convert Python 2.2 code to 3.0. It would just sort of-- the subset of Python 2.2 that actually is validly transformable into 3.0 is slightly smaller. I would recommend that. I mean, the bigger nightmare is for developers who have extension modules, because the C API is going to be-- it's going to be a rougher ride, unfortunately. Well, if you all aren't exhausted, I certainly am. So I thank you for staying all the way until the end. [APPLAUSE]