PyToc

PyToc generates a table of contents for an HTML document based on headings, with anchor links from the TOC to specific headings.

By Ryan McGreal

Posted January 11, 2010 in Projects (Last Updated April 15, 2010)

Contents

1Introduction
2Download
3Requirements
4Using PyToc
4.1 Input Properties
4.2 Methods
4.3 Output Properties
5History
6Need for Flexibility

1 Introduction

PyToc generates a table of contents for an HTML document based on headings, with anchor links from the TOC to specific headings.

2 Download

You can download the latest version of PyToc from its github repository:

3 Requirements

4 Using PyToc

You can see the code in action on this very page!

It's pretty simple to use. Download pytoc.py and save it somewhere in your PATH.

Here's a demonstration:

>>> import urllib
>>> import pytoc
>>> url = 'http://quandyfactory.com/projects/40/pytoc'
>>> page = urllib.urlopen(url)
>>> html = page.read()
>>> toc = pytoc.Toc(html_in=html)
>>> toc.make_toc()
True
>>> toc.html_toc # returns an HTML table of contents
>>> toc.html_out # returns the html with anchors and numbering in headings
>>> toc.toc_list # returns a list of tuples in the form (section number, title)

4.1 Input Properties

The following are input properties you enter to generate the table of contents.

  • html_in - The HTML document for which you want to generate a table of contents.

    This is the only necessary property to assign. The rest have default values that may meet your needs.

  • levels - A list of numbers corresponding to the heading levels you want to include in your TOC.

    E.g. [3, 4] would include

    and

    headings. Default is [3, 4].

  • id - The base id of the HTML table of contents to be generated. Default is "toc".

  • title - The title of the generated table of contents. Default is "Contents".

4.2 Methods

  • make_toc() - this generates the table of contents and populates the output properties. Returns True when complete.

4.3 Output Properties

After calling the make_toc() method, the following output properties are populated with values.

  • html_out - The same as html_in except with the TOC anchors and numbering included in the headings.

  • html_toc - The generated HTML table of contents.

  • toc_list - A list of tuples containing the anchors and headings, in case you would rather roll your own HTML table of contents.

That's it, really.

5 History

This library started out as one-off code to generate a table of contents for a long document that I'm converting from MS Word format over to HTML.

The existing Word document has had many contributors and editors over the years and the format is a shambles. The table of contents is a mess and the headings are all over the place. (Thanks, WYSIWYG.)

I converted the whole thing into Markdown, a simple plain-text formatting syntax that converts to clean, structural HTML.

I still wanted the final document to have a table of contents, but I didn't want to have to go through the bother of maintaining the thing - especially if I ever wanted to add a new section in the middle, which would require a re-numbering of all the subsequent sections and subsections.

I whipped up a simple parser that walked the document and dynamically generated a table of contents, plus anchors in the document so the section headings listed in the contents could straight down to the sections themselves.

6 Need for Flexibility

It worked, but the code was brittle. It required the HTML formatting to be very strict, e.g. the following would work:

Some subheading title

but the following would not work:

Some subheading title

Documents generated using python-markdown2 would work pretty consistently, but documents generated using, say, python-markdown would not, since the latter produces messier HTML output.

Of course, with documents produced using other means, all bets were off.

Anyway, I decided that if this was going to be at all useful as a general-purpose tool, it needed to be more flexible and forgiving. Of course, if you're programming in python, parsing HTML and want to be flexible and forgiving, there's no better tool than the mighty BeautifulSoup parsing library.

So I re-wrote it using BeautifulSoup.