Hi there!

Because some of you asked, how I realized the grabbing and thumbnailing of whole websites (here’s an example and I wrote about that in this post), this is a brief HOWTO.

Imagine, you have a Linux system without graphical support. How do you display complex graphical content and make a screenshot? Here it comes: grabbing websites on a Linux system is quite simple.

Prerequisites:

  1. a Linux operating system (Debian is fine)
  2. khtml2png (I used khtml2png_2.7.6_i386.deb from here)
  3. a running X server (Xvfb does it for me)
  4. kdelibs4c2a
  5. libkonq4

This is it!

The trick now is: on a system working as a server, you usually don’t want to have a running X server. So, I just installed Xvfb, which is a “Virtual Framebuffer ‘fake’ X server”. It is running in the background and khtml2png uses its display.

First, install Xvfb and several libs:

apt-get install xvfb kdelibs4c2a libkonq4

Hit ‘y’ to solve dependencies!

Now, get khtml2png from http://sourceforge.net/projects/khtml2png/ and install it:

dpkg -i khtml2png_2.7.6_i386.deb

Then, start your ‘fake’ X server:

/usr/bin/Xvfb :2 -screen 0 1920x1200x24

Of course, you may reduce the resolution to your needs. But remember the display number (:2) you set for Xvfb.

And finally, you may use khtml2png to fetch any website you like:

/usr/bin/khtml2png2 --display :2 --width 1024 --height 768 http://www.thomasgericke.de/ /tmp/website.png

Don’t worry about the fact that the package is named khtml2png and the binary is called khtml2png2. It’s okay!

I have a little magical wrapper around that stuff which gets URLs out of a database and performs some checks. Images are save with wget and converted to PNG, websites are fetched with khtml2png. Both are saved and thumbnailed on-the-fly with PHP.

I call khtml2png via cron like this:

/usr/bin/khtml2png2   --display :2 \
                      --width 1024 \
                      --height 768 \
                      --time 42 \
                      --disable-js \
                      --disable-java \
                      --disable-plugins \
                      --disable-redirect \
                      --disable-popupkiller \
                      http://www.thomasgericke.de/ \
                      /tmp/website.png

My script is started every minute and checks if new URLs have to be fetched. It also checks if existing PNGs are older than 24 hours and, if so, the URL will be fetched and the PNG overwritten.

Just let me know, if you have any further questions.

Hat’s gefallen?

Wenn der Artikel gefallen hat oder hilfreich war, bitte folgende Funktionen benutzen:

Tweet
  • New blog post: HOWTO grab and thumbnail websites http://tinyurl.com/d9eqh5

  • Johann:

    Hi there Thomas. Yours is only a handful of articles I could find on using khtml2png. I have it installed and running, for most URLs. However, for some webpages, I encounter a DOM::DomException from khtml2png2.

    /usr/local/bin/khtml2png2 –width 1024 –height 768 –time 42 –disable-js –disable-java –disable-plugins –disable-redirect –disable-popupkiller http://www.google.com/ google.png
    terminate called after throwing an instance of ‘DOM::DOMException’
    KCrash: Application ‘khtml2png2’ crashing…

    I have tried this command with different parameters. The same exception is always thrown. yahoo.com also results in the same exception. I wonder if you’ve encountered this exception before, and what solutions you were able to find?

    Thanks for your attention.

  • @Johann:
    I’m having the same issues. Since I run a daemon which produces thumbnails almost on the fly, I rehash all required processes once in a while.

    I could not find a final solution yet.

Leave a Reply