// you’re reading...


Webserver plugin / extension: Export RedDot CMS content – Parse published pages with wget

Few days ago, Tom Black from the University of Arkansas was looking for a solution on how to Export content to Office compatible files on the google groups. Best thing to get would be a XML, XHTML or something else.

Web Solutions Management Server (yeah, RedDot CMS..) doesn’t have an appropriate Export

After figuring out, that there seems to be no proper solution for that, because the built-in export function RedDot (or, yes, esteemd readers – OR – the Web Solutions Management Server) provides doesn’t deliver exactly what he was looking for, he had to built his own solution.

Time for DIY and Download the little helper

Tom decided to create a script on his own with his colleague Warner Skoch.
I asked if it is possible to publish the simple but genius thing and here it is.
It is obviously no plugin, but the same rules follow here.
It comes with absolutely no warranty and is published under a creative commons license. Download now and donate here.

How to use it – What you definitely need

To get the thing running you have to know that:

  • you have to run the batch script from your Windows XP or Vista desktop with a network connection. You do not need to install it on you web server you just need the network connection to your inetpub or htdocs or www folder where RedDot CMS/Web Solutions Management Server publishes all the files
  • you need to download AND install Active Pearl – http://www.activestate.com/activeperl/ ActivePearl should be installed and run from your desktop as well. (There is nothing that needs to be installed or run on the web server.)
  • configure it for your needs
  • execute it on your Windows XP or Vista desktop computer

The script is configured for PHP files, but if you read the short readme.txt file provided carefully, you will find that it can be easily changed for using it on all web related file extensions.

RTFM – Here is the README.txt


This script uses the wget executable included in the BAT file (but it should work with unix wget as well) to grab all php files from a given url. The process it uses as it grabs the php files appears to the server to be someone browsing the site, so the php files are fully rendered. After grabbing the php files, a perl script is run to grab the content portion.

The content portion is recognized by an opening and closing marker (explained in the perl script itself), which you can change to whatever is easiest for you (a familiarity with regular expressions will be helpful if you’d like to change this). Everything is saved to one output file (which you can name whatever you want), with different entries separated by a line of dashes. This was written to easily grab the content portion of a RedDot webpage, but could be used for other CMSs as well.

Instructions for running:

  1. Edit your templates to have beginning and ending markers so that script will understand what content you need pulled from the site. Publish all pages after the template change.

    <!-- beginningContent -->
      <p>your content</p>
      <p>more content</p>
    <!-- endingContent -->
  2. To run, make sure ActivePerl is installed (http://www.activestate.com/activeperl/).
  3. The contentGrabber.bat file is pre-configured to grab PHP files. You may edit the batch file to grab any files you wish by editing line 3. Replace the “php” to any file extension you wish. Or you may add multiple extensions by adding extensions to the script. Example: “-A.php,.aspx,.asp,.jpg,.png,.gif”. Save your changes.
  4. Edit the configuration area in the recursiveContentGrabber.pl file. Fill in the beginning and ending markers of your content.
  5. Run the contentGrabber.bat file.
  6. It will ask you for the url of the page you would like to grab content from. Simply type in the url WITHOUT “http://”.
  7. You will then be prompted for an output file name. You can name it anything you wish, but I suggest using a HTML filetype (filename.html), since the script adds doctype information to make the output readable in FireFox, or Word.
  8. Wget will proceed to grab all of the files. The recursiveContentGrabber.pl script will auto-run and parse all of the files and print out all content between the beginning and ending markers to an HTML file. You may open this file in Microsoft Word or OpenOffice for editing.

Enjoy using it and thanks to Tom Black & Warner Skoch for providing this useful little helper!

Share and Enjoy:
  • Print
  • email
  • Twitter
  • Digg
  • Reddit
  • StumbleUpon
  • Google Bookmarks
  • del.icio.us
  • MisterWong
  • Facebook
  • LinkedIn

No related posts.

About the author:

Markus Giesen Markus Giesen is a Solutions Architect and RedDot CMS Consultant, formerly based in Germany. Travelling around the world to find and offer solutions for a better world (in a very web based meaning). He just found a way to do this as part of a Melbourne based online consultant house. On this blog Markus shares his personal (not his employers) thoughts and opinions on CMS and web development. In his spare time you will find him reading, snowboarding or travelling. Also, you should follow him on Twitter!


No comments for “Webserver plugin / extension: Export RedDot CMS content – Parse published pages with wget”

Post a comment

Stay up to date! - Get notified about followup comments

If you don't feel the urge to comment but wish to stay in the loop:
Just enter your email and subscribe to new comments.

Subscribe without commenting

Recent Tweets

  • RT @AirKraft: Transport Canada breakout: they manage 80K pages and 300K assets with WSM(RedDot). Wow! #OTCW 2010-11-11
  • The RedDot usergroup session 'Future of WCM' is in National Harbor 7, now. See you there! #otcw 2010-11-11
  • RT @yttergren: @AirKraft: Calling all WSM(RedDot) devs: share your solutions on http://bit.ly/bgPIof EVERY solution can win an iPad #OTCW 2010-11-10
  • Come to the Solution Exchange session. Enhance your (#reddot) CMS project! Chesapeake 12, 3:20pm #otcw Looking forward to see you there! 2010-11-10
  • More updates...