HOWTO grab and thumbnail websites

Hi there!

Because some of you asked, how I realized the grabbing and thumbnailing of whole websites (here’s an example and I wrote about that in this post), this is a brief HOWTO.

Imagine, you have a Linux system without graphical support. How do you display complex graphical content and make a screenshot? Here it comes: grabbing websites on a Linux system is quite simple.

Prerequisites:

  1. a Linux operating system (Debian is fine)
  2. khtml2png (I used khtml2png_2.7.6_i386.deb from here)
  3. a running X server (Xvfb does it for me)
  4. kdelibs4c2a
  5. libkonq4

This is it!

The trick now is: on a system working as a server, you usually don’t want to have a running X server. So, I just installed Xvfb, which is a “Virtual Framebuffer ‘fake’ X server”. It is running in the background and khtml2png uses its display.

First, install Xvfb and several libs:

apt-get install xvfb kdelibs4c2a libkonq4

Hit ‘y’ to solve dependencies!

Now, get khtml2png from http://sourceforge.net/projects/khtml2png/ and install it:

dpkg -i khtml2png_2.7.6_i386.deb

Then, start your ‘fake’ X server:

/usr/bin/Xvfb :2 -screen 0 1920x1200x24

Of course, you may reduce the resolution to your needs. But remember the display number (:2) you set for Xvfb.

And finally, you may use khtml2png to fetch any website you like:

/usr/bin/khtml2png2 --display :2 --width 1024 --height 768 http://www.thomasgericke.de/ /tmp/website.png

Don’t worry about the fact that the package is named khtml2png and the binary is called khtml2png2. It’s okay!

I have a little magical wrapper around that stuff which gets URLs out of a database and performs some checks. Images are save with wget and converted to PNG, websites are fetched with khtml2png. Both are saved and thumbnailed on-the-fly with PHP.

I call khtml2png via cron like this:

/usr/bin/khtml2png2   --display :2 \
                      --width 1024 \
                      --height 768 \
                      --time 42 \
                      --disable-js \
                      --disable-java \
                      --disable-plugins \
                      --disable-redirect \
                      --disable-popupkiller \
                      http://www.thomasgericke.de/ \
                      /tmp/website.png

My script is started every minute and checks if new URLs have to be fetched. It also checks if existing PNGs are older than 24 hours and, if so, the URL will be fetched and the PNG overwritten.

Just let me know, if you have any further questions.

You might also like:

About Thomas

Born 1976, German, IT-Specialist, interested in: Photography, Politics, Social Networking and more
This entry was posted in tech and tagged , , . Bookmark the permalink.

One Response to HOWTO grab and thumbnail websites

  1. New blog post: HOWTO grab and thumbnail websites http://tinyurl.com/d9eqh5

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>