HTML-Code von Webseiten automatisiert mit Internet Explorer speichern

Hin und wieder möchte man für den einen oder anderen Zweck den HTML-Code von Webseiten abspeichern. Sicher könnte man hierfür den Quellcode jeder einzelnen Seite mit dem Webbrowser anzeigen und den Inhalt in eine separate Datei abspeichern – oder noch einfacher im Menü "Speichern" aufrufen.

Praktischer ist es aber, insbesondere für viele URLs/Webseiten, es automatisiert von Python erledigen zu lassen. Das folgende Python Script automatisiert den Microsoft Internet Explorer und holt vollautomatisch den Quellcode einer angegebenen URL:

 # This example need ActivePython or any other Python distribution # with the Pywin32 module from Marc Hammond from win32com.client import Dispatch from time import sleep def download_url(url):     """     Note: IE internally formats all HTML to stupid mixed-case, no-     quotes-around-attributes syntax. So if you are planning to parse     the data, make sure you study the output of this function rather     than looking at View-source alone.     """     ie = Dispatch("InternetExplorer.Application")     ie.Visible = 1      ie.Navigate(url)     #it takes a little while for page to load     if ie.Busy:     sleep(2)     #now, we got the page loaded and DOM is filled up - so get the text     text = ie.Document.body.innerHTML     #text is in unicode, so get it into a string     text = unicode(text)     text = text.encode('ascii','ignore')     print text     ie.Quit()     return text  download_url('http://www.heise.de')

Statt “print text” könnte man an der Stelle gleich den Inhalt in eine Datei schreiben.

Das könnte dich ebenfalls interessieren

PDF im Batch mit OpenOffice und Python

Eine Antwort schreiben Cancel reply