PyToc

PyToc generates a table of contents for an HTML document based on headings, with anchor links from the TOC to specific headings.

By Ryan McGreal. 663 words. Approximately a 2 to 4 minute read.
Posted January 11, 2010 in Projects. (Last Updated April 15, 2010)

Contents

1Introduction
2Download
3Requirements
4Using PyToc
4.1 Input Properties
4.2 Methods
4.3 Output Properties
5History
6Need for Flexibility

1 Introduction

PyToc generates a table of contents for an HTML document based on headings, with anchor links from the TOC to specific headings.

2 Download

You can download the latest version of PyToc from its github repository:

3 Requirements

4 Using PyToc

You can see the code in action on this very page!

It's pretty simple to use. Download pytoc.py and save it somewhere in your PATH.

Here's a demonstration:

>>> import urllib
>>> import pytoc
>>> url = 'http://quandyfactory.com/projects/40/pytoc'
>>> page = urllib.urlopen(url)
>>> html = page.read()
>>> toc = pytoc.Toc(html_in=html)
>>> toc.make_toc()
True
>>> toc.html_toc # returns an HTML table of contents
>>> toc.html_out # returns the html with anchors and numbering in headings
>>> toc.toc_list # returns a list of tuples in the form (section number, title)

4.1 Input Properties

The following are input properties you enter to generate the table of contents.

4.2 Methods

4.3 Output Properties

After calling the make_toc() method, the following output properties are populated with values.

That's it, really.

5 History

This library started out as one-off code to generate a table of contents for a long document that I'm converting from MS Word format over to HTML.

The existing Word document has had many contributors and editors over the years and the format is a shambles. The table of contents is a mess and the headings are all over the place. (Thanks, WYSIWYG.)

I converted the whole thing into Markdown, a simple plain-text formatting syntax that converts to clean, structural HTML.

I still wanted the final document to have a table of contents, but I didn't want to have to go through the bother of maintaining the thing - especially if I ever wanted to add a new section in the middle, which would require a re-numbering of all the subsequent sections and subsections.

I whipped up a simple parser that walked the document and dynamically generated a table of contents, plus anchors in the document so the section headings listed in the contents could straight down to the sections themselves.

6 Need for Flexibility

It worked, but the code was brittle. It required the HTML formatting to be very strict, e.g. the following would work:

Some subheading title

but the following would not work:

Some subheading title

Documents generated using python-markdown2 would work pretty consistently, but documents generated using, say, python-markdown would not, since the latter produces messier HTML output.

Of course, with documents produced using other means, all bets were off.

Anyway, I decided that if this was going to be at all useful as a general-purpose tool, it needed to be more flexible and forgiving. Of course, if you're programming in python, parsing HTML and want to be flexible and forgiving, there's no better tool than the mighty BeautifulSoup parsing library.

So I re-wrote it using BeautifulSoup.