{"id":571,"date":"2009-02-09T02:15:42","date_gmt":"2009-02-09T01:15:42","guid":{"rendered":"http:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/?p=571"},"modified":"2009-03-12T22:17:47","modified_gmt":"2009-03-12T21:17:47","slug":"howto-grab-and-thumbnail-websites","status":"publish","type":"post","link":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/2009\/02\/howto-grab-and-thumbnail-websites\/","title":{"rendered":"HOWTO grab and thumbnail websites"},"content":{"rendered":"<p>Hi there!<\/p>\n<p>Because some of you asked, how I realized the grabbing and thumbnailing of whole websites (<a href=\"http:\/\/unfake.it\/XQh*\" target=\"_blank\">here&#8217;s an example<\/a> and I wrote about that <a href=\"http:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/2009\/02\/httpunfakeit-goes-magic\/\" target=\"_blank\">in this post<\/a>), this is a brief HOWTO.<\/p>\n<p>Imagine, you have a Linux system without graphical support. How do you display complex graphical content and make a screenshot? Here it comes: <em><strong>grabbing websites on a Linux system is quite simple<\/strong><\/em>.<\/p>\n<p>Prerequisites:<\/p>\n<ol>\n<li>a Linux operating system (Debian is fine)<\/li>\n<li><code>khtml2png<\/code> (I used <code>khtml2png_2.7.6_i386.deb<\/code>\u00a0from <a href=\"http:\/\/sourceforge.net\/projects\/khtml2png\/\" target=\"_blank\">here<\/a>)<\/li>\n<li>a running X server (<code>Xvfb<\/code> does it for me)<\/li>\n<li><code>kdelibs4c2a<\/code><\/li>\n<li><code>libkonq4<\/code><\/li>\n<\/ol>\n<p>This is it!<\/p>\n<p>The trick now is: on a system working as a server, you usually don&#8217;t want to have a running X server. So, I just installed <code>Xvfb<\/code>, which is a &#8220;Virtual Framebuffer &#8216;fake&#8217; X server&#8221;. It is running in the background and <code>khtml2png <\/code>uses its display.<\/p>\n<p>First, install <code>Xvfb <\/code>and several libs:<\/p>\n<pre class=\"brush: bash\">apt-get install xvfb kdelibs4c2a libkonq4<\/pre>\n<p>Hit &#8216;y&#8217; to solve dependencies!<\/p>\n<p>Now, get <code>khtml2png <\/code>from <a href=\"http:\/\/sourceforge.net\/projects\/khtml2png\/\">http:\/\/sourceforge.net\/projects\/khtml2png\/<\/a>\u00a0and install it:<\/p>\n<pre class=\"brush: bash\">dpkg -i khtml2png_2.7.6_i386.deb<\/pre>\n<p>Then, start your &#8216;fake&#8217; X server:<\/p>\n<pre class=\"brush: bash\">\/usr\/bin\/Xvfb :2 -screen 0 1920x1200x24<\/pre>\n<p>Of course, you may reduce the resolution to your needs. But remember the display number (:2) you set for <code>Xvfb<\/code>.<\/p>\n<p>And finally, you may use <code>khtml2png <\/code>to fetch any website you like:<\/p>\n<pre class=\"brush: bash\">\/usr\/bin\/khtml2png2 --display :2 --width 1024 --height 768 http:\/\/www.thomasgericke.de\/ \/tmp\/website.png<\/pre>\n<p>Don&#8217;t worry about the fact that the package is named <code>khtml2png <\/code>and the binary is called <code>khtml2png<strong>2<\/strong><\/code>. It&#8217;s okay!<\/p>\n<p>I have a little magical wrapper around that stuff which gets URLs out of a database and performs some checks. Images are save with <code>wget <\/code>and converted to PNG, websites are fetched with <code>khtml2png<\/code>. Both are saved and thumbnailed on-the-fly with PHP.<\/p>\n<p>I call <code>khtml2png<\/code> via <code>cron<\/code> like this:<\/p>\n<pre class=\"brush: bash\">\/usr\/bin\/khtml2png2\u00a0\u00a0 --display :2 \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 --width 1024 \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 --height 768 \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0--time 42 \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0--disable-js \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0--disable-java \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0--disable-plugins \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0--disable-redirect \\\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0--disable-popupkiller \\\r\n                      http:\/\/www.thomasgericke.de\/ \\\r\n                      \/tmp\/website.png<\/pre>\n<p>My script is started every minute and checks if new URLs have to be fetched. It also checks if existing PNGs are older than 24 hours and, if so, the URL will be fetched and the PNG overwritten.<\/p>\n<p>Just let me know, if you have any further questions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hi there! Because some of you asked, how I realized the grabbing and thumbnailing of whole websites (here&#8217;s an example and I wrote about that in this post), this is a brief HOWTO. Imagine, you have a Linux system without graphical support. How do you display complex graphical content and make a screenshot? Here it [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"footnotes":""},"categories":[11],"tags":[121,61,101],"class_list":["post-571","post","type-post","status-publish","format-standard","hentry","category-tech","tag-html","tag-linux","tag-unix"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/posts\/571","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/comments?post=571"}],"version-history":[{"count":17,"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/posts\/571\/revisions"}],"predecessor-version":[{"id":721,"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/posts\/571\/revisions\/721"}],"wp:attachment":[{"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/media?parent=571"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/categories?post=571"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.thomasgericke.de\/v4\/interactive\/blog\/wp-json\/wp\/v2\/tags?post=571"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}