Webserver plugin / extension: Export RedDot CMS content – Parse published pages with wget
Few days ago, Tom Black from the University of Arkansas was looking for a solution on how to Export content to Office compatible files on the google groups. Best thing to get would be a XML, XHTML or something else.
Web Solutions Management Server (yeah, RedDot CMS..) doesn’t have an appropriate Export
After figuring out, that there seems to be no proper solution for that, because the built-in export function RedDot (or, yes, esteemd readers – OR – the Web Solutions Management Server) provides doesn’t deliver exactly what he was looking for, he had to built his own solution.
Time for DIY and Download the little helper
Tom decided to create a script on his own with his colleague Warner Skoch.
I asked if it is possible to publish the simple but genius thing and here it is.
It is obviously no plugin, but the same rules follow here.
It comes with absolutely no warranty and is published under a creative commons license. Download now and donate here.
How to use it – What you definitely need
To get the thing running you have to know that:
- you have to run the batch script from your Windows XP or Vista desktop with a network connection. You do not need to install it on you web server you just need the network connection to your inetpub or htdocs or www folder where RedDot CMS/Web Solutions Management Server publishes all the files
- you need to download AND install Active Pearl – http://www.activestate.com/activeperl/ ActivePearl should be installed and run from your desktop as well. (There is nothing that needs to be installed or run on the web server.)
- configure it for your needs
- execute it on your Windows XP or Vista desktop computer
The script is configured for PHP files, but if you read the short readme.txt file provided carefully, you will find that it can be easily changed for using it on all web related file extensions.
RTFM – Here is the README.txt
This script uses the wget executable included in the BAT file (but it should work with unix wget as well) to grab all php files from a given url. The process it uses as it grabs the php files appears to the server to be someone browsing the site, so the php files are fully rendered. After grabbing the php files, a perl script is run to grab the content portion.
The content portion is recognized by an opening and closing marker (explained in the perl script itself), which you can change to whatever is easiest for you (a familiarity with regular expressions will be helpful if you’d like to change this). Everything is saved to one output file (which you can name whatever you want), with different entries separated by a line of dashes. This was written to easily grab the content portion of a RedDot webpage, but could be used for other CMSs as well.
Instructions for running:
- Edit your templates to have beginning and ending markers so that script will understand what content you need pulled from the site. Publish all pages after the template change.
<!-- beginningContent -->
<!-- endingContent -->
- To run, make sure ActivePerl is installed (http://www.activestate.com/activeperl/).
- The contentGrabber.bat file is pre-configured to grab PHP files. You may edit the batch file to grab any files you wish by editing line 3. Replace the “php” to any file extension you wish. Or you may add multiple extensions by adding extensions to the script. Example: “-A.php,.aspx,.asp,.jpg,.png,.gif”. Save your changes.
- Edit the configuration area in the recursiveContentGrabber.pl file. Fill in the beginning and ending markers of your content.
- Run the contentGrabber.bat file.
- It will ask you for the url of the page you would like to grab content from. Simply type in the url WITHOUT “http://”.
- You will then be prompted for an output file name. You can name it anything you wish, but I suggest using a HTML filetype (filename.html), since the script adds doctype information to make the output readable in FireFox, or Word.
- Wget will proceed to grab all of the files. The recursiveContentGrabber.pl script will auto-run and parse all of the files and print out all content between the beginning and ending markers to an HTML file. You may open this file in Microsoft Word or OpenOffice for editing.
Enjoy using it and thanks to Tom Black & Warner Skoch for providing this useful little helper!
No related posts.
About the author: