mecker. mecker. mecker.

Summarizing (X)HTML in Python

Hey, my first post in english. It’s for search engine op­ti­miza­t­i­on only :P

I currently use this snippet generator for my post previews. In some situation it behaves wrong (pre­ser­ver­ing whitespace in pre-tags, to be precisely) and it is really hard to maintain. See the source code.

This is my first time using the HTMLParser of the python standard library. It was really straight-forwarded to solve this without any side effects! Currently, it is slower than the ac­tive­sta­te-solution, which is about two times faster. Sum­ma­ri­zing Wiki-Software: MindTouch Deki with about 39 lines and 3483 characters takes about 20ms versus 40ms.

I have no plans to update this code listing, please take a look at my personal Git-Repository for any changes: git clone http://git.posativ.org/posativ

#!/usr/bin/env python
"""
An awesome way of http://code.activestate.com/recipes/499336-summarizing-xhtml/
Released under the WTFPL
"""

from HTMLParser import HTMLParser

class Summarizer(HTMLParser):

    def __init__(self, text, maxwords=100):
        HTMLParser.__init__(self)
        self.maxwords = maxwords
        self.summarized = ""
        self.words = 0
        self.stack = []

        self.feed(text)

    def handle_starttag(self, tag, attrs):
        """Apply and stack each read tag until we reach maxword."""

        def tagify(tag, attrs):
            """convert parsed tag back into a html tag"""
            if attrs:
                return " <%s %s>" % (tag, " ".join(["%s="%s"" % (k, v) for k,v in attrs]))
            else:
                return "<%s>" % tag

        if self.words < self.maxwords:
            self.stack.append(tag)
            self.summarized += tagify(tag, attrs)

    def handle_data(self, data):

        if self.words >= self.maxwords:
            pass
        else:
            words = data.split(" ")
            if self.words + len(words) < self.maxwords:
                """if the next few words will not go over maxwords"""
                self.words += len(words)
                self.summarized += data
            else:
                """we can put some words before we reach the word limit"""
                somewords = self.maxwords - self.words
                self.words += somewords
                self.summarized += " ".join(words[:somewords]) + " "
                self.summarized += "... <a href="?p=%s" class="continue">continue</a>." % "abcdef"

    def handle_endtag(self, tag):
        """Until we reach not the maxwords limit, we can safely pop every ending tag,
           added by handle_starttag. Afterwards, we apply missing endings tags if missing."""
        if self.words < self.maxwords:
            self.stack.pop()
            self.summarized += "</%s>" % tag
        else:
            if self.stack:
                for x in range(len(self.stack)):
                    self.summarized += "</%s>" % self.stack.pop()

if __name__ == "__main__":

    print Summarizer("", 120).summarized
blog comments powered by Disqus