- Notifications
You must be signed in to change notification settings - Fork137
save/convert web pages to a standalone editable html file for offline archive/view/edit/play/whatever
License
zTrix/webpage2html
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This is a simple script to save a web page to a single html file. No mhtml or pdf stuff, no xxx_files directory, just one single readable and editable html file.
The basic idea is to insert all css/javascript files into html directly, and use base64 data URI for image data.
Save web page directly from url (recommended way):
$ python webpage2html.py https://www.google.com> google.htmlor save web page first using browsers such as Chrome, to something.html with something_files directory beside.
$ python webpage2html.py /path/to/something.html> something_single.htmlBut note that the second method may not always work as expected, because there may be urls like//ssl.gstatic.com/gb/images/v1_c69d5271.png (from google index page), but the file is missing inGoogle_files directory saved by browsers.
Enable javascript, for example, save 2048 game page into a single html for offline playing
$ python webpage2html.py -s http://gabrielecirulli.github.io/2048/> 2048.htmlBeautifulSoup4, lxml, termcolor(optional)
$ pip install -r requirements.txt
or install them manually
$ pip install lxml BeautifulSoup4 requests termcolor
I have tried the defaultHTMLParser andhtml5lib as the backend parser for BeautifulSoup, but both of them are buggy,HTMLParser handles self closing tags (like<br><meta>) incorrectly(it will wait for closing tag for<br>, so If too many<br> tags exist in the html, BeautifulSoup will complainRuntimeError: maximum recursion depth exceeded), andhtml5lib will encode encoded html entities such as< again to&lt;, which is definitly unacceptable. I have tested many cases, andlxml works perfectly, so I choose to uselxml now.
Thetermcolor package is for colored log output support if you like.
The page embeds less css directly and use less.js to compile in browser. In this case, I still cannot find a way to embed the less code into generated html to make it work.
<link rel="stylesheet/less" type="text/css" href="http://dghubble.com/blog/theme/css/style.less"><script src="http://dghubble.com/blog/theme/js/less-1.5.0.min.js" type="text/javascript"></script>- http://lesscss.org/#client-side-usage
- http://dghubble.com/blog/posts/.bashprofile-.profile-and-.bashrc-conventions/
Currently srcset is discarded.
- lukin.a.i submitted a patch to fix not recognised css link (rel=stylesheet) issue
- Gruber.
- Java port of this project.https://github.com/cedricblondeau/webpage2html-java
- https://github.com/presto8
webpage2html useSATA License (Star And Thank Author License), so you have to star this project before using. Read thelicense carefully.
About
save/convert web pages to a standalone editable html file for offline archive/view/edit/play/whatever
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors6
Uh oh!
There was an error while loading.Please reload this page.
