- Normal
- |
- Widen Column
- |
- Larger Text
Recently read:
The other day, some colleagues of mine were looking for programs that could convert PDF documents to HTML. A bit of googling will lead you to all kinds of solutions, some for free (the most popular technique here could be pdf to epub format conversion) but most charge for the conversion. Both approaches were tried but the resulting html document would be all messed up - a nightmare to handle for even the most expert in css. A good way to avoid this mess is to render each page of the document as an image tag in the html file.
So get your windows or mac snipping tool out and start capturing the pages of the PDF. But wait, whats that? You got 200 pages in your document! Well, good luck then. You could hire some interns to do that for you or you could use my simple trick (NOTE: this will work for linux/mac users. Windows users can read upto step 4 and then i wish them luck).
Lets do it in steps (I was always instructed by my History teacher to write all my answers in steps; gets you more points):
-
Upload/attach the 200-page pdf file to a new gmail message and send it to someone (like a friend named Dev Null) or save it as a draft.
-
Open the pdf attachment by clicking on 'view'.
-
Now, Google will render that pdf for you as html with each page being an image. Zoom in (or out) to adjust the resolution of the image. A full zoom will give you a 1600+ pixel wide image which should suffice for most computer screen resolutions. Make sure all the pages are rendered well (scroll all the way down to ensure so).
Page while its rendering
Page fully rendered -
Save the whole web page from your browser menu (save as complete webpage). This will give you an index.html file. Now you could stop here since you have the pdf as webpage. But you would want to remove google's pdf-viewer controls, iframes and duplicate image tags for each page (see the html source file).
-
Create a new shell script file (call it 'clean_html.sh') and in it add the following lines:
l1=`grep -n 'id\="chrome"' $1 | cut -f1 -d:` l2=`grep -n 'id\="thumb-pane"' $1 | cut -f1 -d:` line="`expr $l1`,`expr $l1 + 3`" line2="`expr $l2`,`expr $l2 + 4`" sed -e "$line"'d' -e "$line2"'d' -e 's/\<div class\="highlight\-pane"\>\<\/div\>/ /g' -e 's/<[^>]*display: none;[^>]*>//g' -e 's/<center/<div/g' -e 's/<\/center>/<\/div>/g' -e '/body {/,/}/ s/hidden;/auto;/' $1 > $2Save the file.
-
Now we apply the trick which is a single-line unix command that cleans up almost all the stuff you don't need and gives you the plain-jane html:
sh clean_html.sh index.html new_index.html
Open the new_index.html file in the browser and notice the difference.
So thats it! There could be a few more css changes to give it the look you want but they'll be minor and easy to handle.
Filed under: Tutorials
How about some bookmarking love?
Post To Twitter Bookmark on Delicious Bookmark on Digg Share on Facebook Update On Friendfeed Bookmark on Reddit Bookmark on Mixx