Review: Mark Pilgrim's Dive Into Python 3
If you're an experienced programmer, you want to learn Python 3, and you don't have a lot of time to waste, skip this review and just go straight to the book.
By Ryan McGreal
Posted February 08, 2010 in Blog (Last Updated February 08, 2010)
Contents
1 TL;DR Summary ↑
If you're an experienced programmer, want to learn Python 3, and don't have a lot of time to waste, skip this review and just go straight to Mark Pilgrim's Dive Into Python 3.
DIP3 starts with practical, working code, takes it apart piece by piece, puts it back together, and leaves you with a solid understanding of the concepts and their applications.
DIP3 is opinionated but well-informed. It hammers home the important stuff and skips the blather. It's also very well written: terse, witty, and expressive (like Python itself).
2 Introduction ↑
Mark Pilgrim's Dive Into Python 3 came out late last year, but I didn't receive a review copy until late January - complete with autograph and a friendly note from the author apologizing for the delay. (Disclosure: I felt warm and fuzzy after receiving the personal note. Also, the review copy I received was free.)
If you're familiar with Mark Pilgrim, you'll know that he's an opinionated writer. DIP3 is no exception; this is Python 3 the way Mark Pilgrim wants you to understand it.
The good news is that Pilgrim is a reliable narrator. That is, he really knows his stuff; and his opinions, while strong, are deeply knowledgeable, ethically and philosophically consistent, and shared in a spirit of cooperation and stewardship.
3 Purpose and Relation to Dive Into Python ↑
Comparisons to Pilgrim's original Python book, Dive Into Python (also published by Apress), are inevitable. The good news is that this edition improves on the original in most of the ways that matter. The book design is more elegant and stylish, with better use of white space and contrast, cleaner fonts (with one notable exception, about which more below), and clearer layout.
The book is written primarily for experienced programmers coming to Python for the first time, but Pilgrim recognizes that a lot of its readers will have a background in Python 2.
Since Python 3 breaks compatibility with the previous trajectory of versions, the book contains specially formatted bullet notes at points where version 3 breaks with 2.x; for instance, the merging of int
and long
datatypes, or the fact that the /
division operator now triggers floating point division by default, rather than integer division.
By focusing mainly on Python 3 itself and then highlighting the diffs from Python 2, Pilgrim generally gets the best of both worlds.
Still, there are a few ways in which the book isn't entirely sure whether it wants to be a resource for programmers learning Python for the first time, or for Python 2 programmers who want to upgrade. It would be more helpful for experienced Python 2 programmers if Pilgrim noted the diffs consistently rather than sporadically.
Another significant change from DIP is the way it uses programs to introduce the language. In the original book. the introductory program was an 11 line script to build an ODBC connection string from a dictionary. It demonstrated function declaration, datatypes, everything-is-an-object, code blocks (and significant whitespace), and the if __name__ == '__main__'
trick.
The next chapter covered dictionaries, lists, tuples, declaring and returning variables, string formatting (old style, with % ()
notation), and basic list comprehensions. The program was dissected exhaustively over two chapters and concluded with an explicit acknowledgment:
The
odbchelper.py
program and its output should now make perfect sense.
In DIP3, by contrast, the introductory program spans four chapters - and there's never a discrete aha moment where you realize that you understand the program fully.
Still, overall the book is a commendable accomplishment: a concise, accessible, and above all fun introduction to a language also known for concision, accessibility and fun.
4 Why Python 3? ↑
Python 3 has not yet enjoyed wide adoption among programmers - existing Python programmers or otherwise - and part of the reason is that the language is caught in the chicken-egg problem whereby the lack of ported libraries makes it less appealing for coders, while the lack of existing coders makes it less compelling for developers to port their libraries.
My hunch is that Dive Into Python 3 will carry us closer to the tipping point in which coders and library supporters start to upgrade en masse.
For one thing, Python 2 programmers will come away from this book with a much clearer understanding of how many problems and pain points Python 3 solves. In addition, version 3 sweetens the deal with powerful new data structures that already make me envious when I go back to work on projects written in 2.x.
5 Diving In ↑
If you've read programming books before, you'll know that the general format is to start with lots of theoretical background and history, then introduce the syntax, data types and common code blocks; and finally, after a few chapters, start putting together a program.
That was my original experience trying to learn Python a few years ago by reading through How to Think Like a Computer Scientist: Learning with Python by Allen Downey, Jeff Elkner and Chris Meyers.
It wasn't until I read Mark Pilgrim's original Dive Into Python that I really started to get an internalized sense of the language. He raised 'learn by doing' to the level of an art form, and it clicked with me in an immediate and sustained way.
It becomes clear that Dive Into Python 3 follows the same aggressively practical, hands-on approach as soon as you read the opening line of the introduction:
Welcome to Python 3. Let's dive in.
The introduction covers installing Python 3, and it's a significant improvement on the equivalent chapter in the original DIP. If you run Windows, Mac OSX or Debian/Ubuntu, the book takes you step by step through the installation process. (If you run a more exotic operating system, you can probably figure out how to install Python all by yourself.)
6 Basics: Functions, Datatypes and Comprehensions ↑
Chapter 1 covers function declaration and arguments, doc strings, sys.path
, objects (and the oft-repeated fact that everything, in Python, is an object), denoting code blocks through indentation (i.e. significant whitespace), exceptions, variable declarations, and the if __name__ == '__main__'
trick.
These are the absolute basics, and it's noteworthy that Pilgrim includes an introduction to exceptions. He's an opinionated programmer, and he wants you to understand that exception handling is fundamental to good Python code, not an exotic extra to be mentioned in passing once you're already proficient.
Chapter 2 dives into Python's datatypes: booleans and boolean contexts, numbers, type coercion, operations, fractions, lists, tuples (immutable lists), dictionaries, sets (new in Python 3, sets are unordered lists with a syntax similar to dictionaries), and the special None
type.
Chapter 3 introduces comprehensions, a delightful set of syntactic sweeteners that allow you to map and filter an iterable collection using a terse one-liner:
newlist = [func(item) for item in oldlist if test(item)]
New in version 3, Python adds dictionary and set comprehensions to the list comprehensions it already supported.
Aside: this is one of the places where Pilgrim doesn't specifically mention the difference between Python 2 and 3. I found myself going back to the Python 2 documentation for a sanity check just to make sure these structures weren't there and I just somehow missed them.
7 Bytes vs. Characters ↑
Chapter 4 breaks form. Instead of diving straight into code, Pilgrim opens with three pages of exposition - the only such indulgence in the book, and with good reason. Possibly the most beautiful and simultaneously despairing chapter in the book - indeed, this achieves a level of pathos befitting a novel, let alone a technical manual - Pilgrim recounts the tragedy of text on an international data network. And it is a tragedy.
In what might be the most significant break from verson 2, Python 3 consistently and explicitly treats strings as streams of bytes, not streams of characters.
Because a string is a stream of bytes, it must be encoded using a string encoding that maps the bytes to specific characters.
In Python before version 3, the default encoding was ASCII, the American Standard Code for Information Interchange, a seven-bit encoding that handles all conventional English characters - the lowercase and uppercase letters, numerals, and punctuation symbols - plus various control characters including tabs, spaces, newlines and so on.
Of course, not everyone speaks or writes English, and many other languages use additional characters (like the French e-accent-aigu é or the German eszett ß) or even collections of characters with little or no overlap to English (like Japanese Kanji). For these languages, ASCII is wholly inadequate, and different languages have independently developed various encoding systems that often map the same character to different bytes.
The potential for chaos is huge, and indeed a number of incompatible encodings have already caused plenty of grief for people trying to process text on computers.
We've all seen web pages with jumbles of nonsense characters where you would expect, say, quotation marks to go. This is caused by a mismatch between the encoding used to produce the text - say, ANSI CP-1252 in Microsoft Word) and the encoding used to render the text later (say, ISO/IEC 8859-1).
Pilgrim sets the scene eloquently and then, with a seemingly innocuous "Enter Unicode", leads us on a harrowing emotional tennis match, swinging back and forth between optimistic proposed solutions - let's agree to put every single character into one big encoding! - and new problems - each character now takes up 4 bytes! - to follow-up solutions - use agreed-upon subsets! - to still more problems.
Now cry a lot because everything you thought you knew about strings is wrong, and there ain't no such thing as plain text.
Pilgrim doesn't offer simple solutions to the problems he raises, because they don't have simple solutions. Character encoding may be the Great Granddaddy of more-or-less insoluble technical problems. They can't be solved, as such, but if you understand the problems well enough they can be addressed more or less safely.
But as Pilgrim points out in regards to the necessary shortcuts that programs take in mapping out solutions:
[It's] a good assumption right up until the moment that it's not.
String processing is further complicated by the matter of big-endian vs. little-endian byte ordering - which is only a problem when people try to share files from different computers, "perhaps on a worldwide web of some sort".
With all this in mind, you can't help but conclude, despairingly: Character encoding is hard!
But rather than just giving up and going shopping, Pilgrim leads you through the maelstrom and delivers you - shaken but intact - on the mostly-safe harbour of UTF-8. Not that UTF-8 isn't also beset with traps and gotchas.
Chapter 4 demonstrates formatting strings (using Python's powerful new string formatting syntax, first introduces in version 2.6), common string methods, string slicing (a string, after all, is a list of characters bytes), the difference between strings and bytes (lots of gotchas exposed here), and encoding of source code (Python 2 files were ASCII by default, whereas Python 3 files are UTF-8 by default).
8 Digression on the Book's Text Formatting ↑
This is as good a place as any to mention a formatting bug in this edition: the peppering of the text with artifacts - hollow vertically aligned rectangles - in place of special characters like em-dashes. For example:
The first line imports the
humansize
program as a module▯a chunk of code that you use interactively or from a larger Python program.
Given the attention Pilgrim gives to character encoding (I still remember the reddit thread on which he posted an early draft of that section for review and feedback - some of it unbearably pedantic), it seems bizarre that his own book would be bitten by an encoding gotcha!
I contacted Pilgrim to ask if he knew what happened but did not receive a response in time for publication. It may be related to the fact that he had to convert the book from HTML to MS Word before submitting it to the publisher.
9 Regular Expressions ↑
Chapter 5 delves into regular expressions, a powerful and pragmatic DSL for solving real-world data extraction problems. Yet in what is turning into a common theme, regex is complicated and contains some gnarly gotchas and edge cases.
Pilgrim takes us through real-world use scenarios in much the same way that real workaday programmers approach them: by iterating through a sequence of progressive changes and enhancements, until we're satisfied that the solution is Good Enough.
One of the nicest touches of Python's support for regex is verbose mode, which ignores whitespace and allows inline comments so that the regex still makes sense when you have to look at it again in six months. Unless you're parsing text on a daily basis and know regex inside and out, this will be helpful to you.
10 Iterating Through Iteration ↑
Chapter 6 tackles closures and generators. While Python is not as purely functional as a Lisp or even Ruby, it still provides some powerful functional tools. Pilgrim takes a thorny problem - matching pluralization rules for a variety of English nouns - and combines regular expressions with generators (functions that yield their values lazily and iteratively) and meta-functions (functions that take functions as arguments and return functions).
Pilgrim combines these tools to build generators that act as closures and yield functions based on regex pattern matching. It's heady stuff, but by the time he's finished taking it apart and putting it back together, you can't help but understand it. Along the way, we learn about files and file-like objects, and the powerful with
context (about which more below).
Chapter 7 dives deeper, tackling the same problem using classes and iterators. Everything in Python is an object, with properties and methods. In this chapter, Pilgrim explains how you can define your own classes and instantiate them as objects - this case, an iterator class.
He covers Python's special reserved word pass
, which does nothing but is necessary for those occasions when Python syntactically requires something - what Pilgrim calls "a Python reserved word that just means, 'move along, nothing to see here'."
He covers Python's special __init__()
method, distinguishing it semantically from the C++ constructor method; the special self
argument in Python clases; instantiating classes; instance variables; passing in parameters and calling methods.
Then he shows how to create an iterator class and revisits the pluralization code from the previous chapter to produce the same results as a more terse, abstract form.
Aside: when I first discovered Python, I was delighted to discover that it appeared to be lists all the way down. Pilgrim argues by way of contrast that it's actually "iterators all the way down."
Chapter 8 indulges in a bit of whimsy by way of introducing the itertools
library in the context of an extremely terse (just 14 lines!) Alphametics solver. Pilgrim dives a little deeper into regular expressions, introducing the findall()
method and highlighting another pattern matching gotcha (it doesn't return overlapping matches). He also shows how you can use the new set
datatype to extract the unique items in a list.
We also learn how to make assertions and catch AssertError
s with the terse, one-line syntax you may have come to expect from Python by now.
Speaking of which, Pilgrim revisits the generator paradigm with a one-line generator expression. If you prefer not to iterate, you can pass the generator right into a tuple()
, list()
or set()
function.
An important paradigm of functional programming is the use of lazy evaluation, in which the values in an iterable object are calculated on the fly as needed rather than all at once.
Python comes equipped with the powerful itertools
module, which includes methods for generating permuatations, products, combinations, chains, groups, and more.
Along the same lines, Pilgrim shows how to treat a file as an iterable list (most things in Python are iterable), with all the methods and functions available to manipulate lists.
The chapter closes with a vintage Pilgrim dive when he iterates through the chain of problems / solutions / new problems related to Python's eval()
function. eval()
is powerful - and dangerous, since it can run arbitrary code that does anything Python can do (subprocess.getoutput('rm-rf')
anyone?).
Say it with me: "
eval()
is evil!"
It turns out that there is no truly reliable, safe way to expose eval()
to untrusted third parties.
11 Unit Testing ↑
Chapter 9 introduces unit testing. If you unit test, you'll appreciate the value of a powerful, simple testing framework; if you don't unit test, you may well close this paragraph a convert.
Python, of course, comes with a unittest
module that lets you subclass simple, robust test suites to prove, empirically, that your code does what it's supposed to do (at least on a micro level).
Using unittest
and the code from Chapter 5 to generate Roman Numerals, Pilgrim takes you through testing, debugging and (in Chapter 10), safe refactoring.
His philosophy is simple and elegant:
- Write tests.
- Write code.
- Test code.
- Debug code.
- When all the tests pass, stop coding.
Over the course of two chapters, you'd be hard pressed not to come away persuaded that Pilgrim's onto something here.
12 File I/O ↑
Chapter 11 dives deep into files. I/O is central to computing, and Pilgrim has touched on files and file-like objects several times so far. In this chapter, files are the main attraction.
In Python 3, the open()
function takes an encoding
argument so that Python knows how to encode the string of bytes that is a file:
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a "text file" from disk, how does Python convert that sequence of bytes into a sequence of characters?
If you don't specify an encoding, Python 3 uses your system's default encoding (e.g. Windows, where the default encoding is CP-1252). That might be the file's encoding, but it might not - assume at your peril! This is especially true given that Python is cross-platform. Code written on, say, a Linux machine that makes assumptions about file encoding might suddenly fail on a Windows machine.
Pilgrim then dives into opened files as stream objects, with useful properties and methods, including readline()
, seek()
and tell()
. Stream objects also behave like iterators, yielding a line at a time.
In addition to reading files, Python can write files and append new lines to files. But not all files are text files. Python can also open, read, write and append to binary files, byte by byte.
Further, Pilgrim explains how to use the io
library to create and manipulate file-like objects that may not correspond to actual files on disk but still possess file-like properties and methods.
Of course, Python can also handle compressed files via the gzip
module. (The standard Python library is phenomenal in its breadth and usefulness.)
Python is thoroughly cross-platform, but it allows you to use and redirect the stdin, stdout and stderr pipes indigenous to *nix systems. Python shows you how to hook into these pipes.
Pilgrim advocates using the with
context as the safest way to handle files and file-like objects. He notes that Python 3.1 supports multiple nested with
contexts (if you have Python 3 installed, take a few minutes to upgrade to 3.1).
13 Serialization via XML, Pickle and JSON ↑
Chapter 12 wrestles XML. Yes, XML is a special flavour of hell, but you will have to deal with it from time to time, and Python has easy, powerful tools to help you. He sketches out a crash course on XML, concluding, "Now you know just enough XML to be dangerous."
The standard library offers xml.etree.ElementTree
to parse, walk, search, and iterate (it really is iterators all the way down) through, create, and modify your XML object.
Pilgrim pauses to draw our attention to the lxml
module, a third-party library that replicates the ElementTree API but extends it with more powerful search methods.
With matching APIs, your code can prefer the more powerful third party library but fall back on the standard library if the former is not installed.
Pilrim also shows how you can dig into an XML object with bad syntax and recover its data using lxml
.
Of course, XML is evil (albeit an often necessary evil), and Chapter 13 introduces other methods of serializing Python objects.
Sometimes you want to store more than just strings in your data persistence routine, and Python has (wait for it) powerful tools for serializing even complex objects and datatypes - like the standard pickle
module, which has dump()
and load()
methods for storing objects, dicts, lists, and so on.
But pickle
is only good for Python programs, and data often has to pass between different applications written in different languages. Fortunately, Python comes equipped to handle JSON, or JavaScript Object Notation, a data format developed by Douglas Crockford and inspired by javascript object literal syntax (in fact, JSON is valid javascript).
JSON can handle objects, lists and dicts, and the standard json
library can handle JSON. You can also extend JSON with custom serializers for unsupported data types, like tuples.
Pilgrim really knows how to hammer home the gotchas. In his umpteenth reminder to define the encoding of an opened file:
You'll forget! I forget sometimes! And everything will work right up until the moment that it fails, and then it will fail most spectacularly.
14 Take a REST ↑
Chapter 14 takes the reader through creating RESTful, HTTP-based web services. (Pilgrim wisely decided not to update the chapter on SOAP web services from the orginal DIP.)
Python does HTTP via the urllib
and httplib
modules (standard library), but Pilgrim recommends the third-party httplib2
module instead. It's more complete and clean than the former two. In particular, it supports HTTP caching, last-modified testing, ETags, compression, and 30x redirects.
Pilgrim notes:
urllib
speaks HTTP like I speak Spanish - enough to get by in a jam but not enough to hold a conversation. HTTP is a conversation. It's time to upgrade to a library that speaks HTTP fluently.
Pilgrim shows how to fetch data from a web resources without being rude or inefficient, by accepting compressed data, caching files locally, remembering permanent redirects, and so on.
He also demonstrates RESTful interactions with HTTP GET, POST, UPDATE and DELETE requests to a web API (in his example, identi.ca, an open source, twitter-like microblogging service).
Important note: httplib2
returns bytes, not strings. Yes, the character demon rears its head again. You need to specify an encoding.
15 Porting and Packaging ↑
I've mentioned that DIP3 doesn't always explicitly highlight the differences between Python 2 and 3. Chapter 15 comes as something of a remedy.
The bytes/characters dichotomy is paramount throughout this book, since it's a huge gotcha that will burn you sooner or later. (It's one of the reasons the Python development team decided to break compatibility to create Python 3 in the first place.)
It's not surprising that Pilgrim ported a library (from Mozilla) that guesses the character encoding of byte sequences into Python. In this chapter, Pilgrim walks the reader through the sometimes painful exercise of porting the library, called chardet
, from Python 2 to Python 3.
While the book has noted difference between the versions along the way, it has not served as a systematic guide to the changes. As Chapter 15 shows, there are some important but subtle differences under the hood that will break your code and may be infuriating to track down.
Python includes the 2to3.py
tool, which automates the automatable aspects of porting, and highlights those issues it couldn't handle.
Of course, every chapter does double- and triple-duty, and this chapter also introduces the structure and internal arrangement of multi-file libraries.
Speaking of which, Chapter 16 covers packaging Python libraries for distribution. If nothing else, Pilgrim gets an A+++++ Would Definitely Read Again for demystifying the distutils
library and unlocking the PyPi online catalogue.
I've always found the documentation on packaging to be arcane and inaccessible, and this chapter is a much-appreciated remedy.
16 Conclusion ↑
DIP3 is not suitable as an introduction to programming, since Pilgrim assumes you bring a background in programming concepts to the table; but it's perfect for busy programmers looking to gain mastery in Python 3.
Over the course of a witty, compelling narration, the book repeatedly hammers both Python 3's most important strengths (terse syntax, rich datatypes, everything-is-an-object, abundant iteration, powerful libraries) and most dangerous pitfalls (bytes vs. characters).
Pilgrim wastes no more time than is strictly necessary on exposition. We program to solve problems, and DIP3 keeps problem solving - practical, real-world problems - in the driver's seat throughout.
Notwithstanding a few minor quibbles, I can heartily recommend this book to anyone who wants to tackle Python 3. Since reading it, I find myself thinking about my own modest Python applications - sooner or later I'm going to have to cut the cord and make the jump to Python 3.
After reading Dive Into Python 3, that point seems closer than ever.
17 Reference ↑
Mark Pilgrim, Dive Into Python 3, Apress, 2009.
Text copyright Copyright 2009 by Mark Pilgrim. Licenced under the Creative Commons Attribution Share-Alike licence.
DIP3 is available for download in HTML or PDF; or you can clone the document repository:
you@localhost:~$ hg clone http://hg.diveintopython3.org/ diveintopython3