出家如初,成佛有余

Slideshare PPT下载shell脚本

Posted in 技术相关 by chuanliang on 2012/01/26

    以前总结过在slideshare上下载文件的方法(参考:从Slideshare.net 下载Slide的方法) ,只不过以上下载操作都需要手工操作,极其不人性化。尤其是在将swf格式的文件转化为pdf格式时候,要手动对一个几十页的ppt执行以上操作不累死人才怪,因此方法归方法,我自己都不怎么使用。   

    Slideshare Downloader是一个shell脚本,能自动下载slideshare上不提供下载功能的ppt并保存为pdf格式的文档。只不过在本机的Redhat Enterprise Server 5.3上试了一下,发现脚本有些问题。

    1、Slideshare Downloader用于解码swf文件的swfdec包在Redhat Enterprise Server上安装很麻烦,依赖一大堆包,捣腾了半天都未搞定,于是放弃掉使用swfdec。

        swfdec项目似乎从08年依赖就未更新过了,其官方wiki http://swfdec.freedesktop.org  也许久未维护了,登录进去全是灌水的广告帖子,刚开始还以为走错地方了。

       swfdec的安装指南

       swfdec的代码下载地址

   2、Slideshare Downloader直接用imagemagick的convert命令将多个png文件转化为一个pdf文件,测试了一下似乎不行。按照imagemagick 官方帮助的说明,多个png格式的文件不能直接使用convert命令合并成一个pdf文件     

    However, some formats, such as JPEG and PNG, do not support more than one image per file, and in that case ImageMagick is forced to write each image as a separate file.  imagemagick adjoin帮助

   3、Slideshare Downloader脚本应该是在Ubuntu等Debian系列的环境上测试的,在Redhat 下一些命令的语法似乎有问题

      在处理BASH_REMATCH 时候, RedHat 要加“”

    if [[ "$DOCID" =~ "([a-z0-9-]+)$"  ]]
    then
        DOCID=${BASH_REMATCH[0]}
    else
        echo $DOCID
        exit 1
    fi

       在Redhat下sort命令无-V参数

     基于以上几个原因,为方便自己使用,对Slideshare Downloader脚本进行了调整,测试了几个文档,应该还行。需要安装swftoolspdftkimagemagick

    大致的方法:

    1、使用wget -q –O 获得指定url地址文档的所有swf文件的实际地址并下载

    2、使用swftools的swfrender命令将swf文件转为png格式的文件

    3、使用imagemagick的convert  +adjoin 将png格式的每个文件都转化为对应的单个pdf文件。

    4、使用pdftk将多个pdf文件合并成一个pdf文件

     在合并多个pdf文件为一个pdf文件时候,由于需要按照页面顺序合并,因此使用了sort -k1.3

      PDFS=`ls *.pdf | sort -k1.3 `

   调整过的代码(只在Redhat 上做了测试,且对代码未做优化):

#!/bin/bash

# Author: Andrea Lazzarotto

# http://andrealazzarotto.com

# andrea.lazzarotto@gmail.com

# Slideshare Downloader

# This script takes a slideshare presentation URL as an argument and

# carves all the slides in flash format, then they are converted to

# and finally merged as a PDF

# License:

# Copyright 2010-2011 Andrea Lazzarotto

# This script is licensed under the Gnu General Public License v3.0.

# You can obtain a copy of this license here: http://www.gnu.org/licenses/gpl.html

# Usage:

# slideshare-downloader.sh URL [SIZE]

#———————————————–

# Modify 7/08/2011 by giudinvx

# Email  giudinvx[at]gmail[dot]com

#———————————————–

validate_input() {

    # Performs a very basic check to see if the url is in the correct form

    URL=`echo "$1" | cut -d "#" -f 1 | cut -d "/" -f 1-5`

    DOMAIN=`echo "$URL" | cut -d "/" -f 3`

    CORRECT=’www.slideshare.net’

    if [[ "$DOMAIN" != "$CORRECT" ]];

        then

            echo "Provided URL is not valid."

            exit 1

    fi

    if echo -n "$2" | grep "^[0-9]*$">/dev/null

        then SIZE=$2

        else

            SIZE=2000

            echo "Size not defined or invalid… defaulting to 2000."

    fi

}

check_dependencies() {

    # Verifies if all binaries are present

    DEP="wget sed seq  convert"

    ERROR="0"

    for i in $DEP; do

        WHICH="`which $i`"

        if [[ "x$WHICH" == "x" ]];

            then

                echo "Error: $i not found."

                ERROR="1"

        fi

    done

    if [ "$ERROR" -eq "1" ];

        then

            echo "You need to install some packages."

            echo "Remember: this script requires Imagemagick and Swfdec."

            exit 1

    fi

}

build_params() {

    # Gathers required information

    DOCSHORT=`echo "$1" | cut -d "/" -f 5`

    echo "Download of $DOCSHORT started."

    echo "Fetching information…"

    INFOPAGE=`wget -q -O – "$1"`

    DOCID=`echo "$INFOPAGE" | grep "doc=" | cut -d "=" -f 3 | cut -d "&" -f 1`

    if [[ "$DOCID" =~ "([a-z0-9-]+)$" ]]

    then

        DOCID=${BASH_REMATCH[0]}

    else

        echo $DOCID

        exit 1

    fi

    SLIDES=`echo "$INFOPAGE" | grep "totalSlides" | head -n 1 | sed -s "s/.*totalSlides//g" | cut -d ":" -f 2 | cut -d "," -f 1`

    echo "Slides: $SLIDES"

    echo "Size: $SIZE"

}

create_env() {

    # Finds a suitable name for the destination directory and creates it

    DIR=$DOCSHORT

    if [ -e "$DIR" ];

        then

            I="-1"

            OLD=$DIR

            while [ -e "$DIR" ]

            do

                I=$(( $I + 1 ))

                DIR="$OLD.$I"

            done

    fi

    mkdir "$DIR"

}

fetch_slides() {

    for i in $( seq 1 $SLIDES ); do

        echo "Downloading slide $i"

        wget "http://cdn.slidesharecdn.com/`echo $DOCID`-slide-`echo $i`.swf" -q -O "$DIR/slide-`echo $i`.swf"

    done

    echo "All slides downloaded."

}

convert_slides() {

    for i in $( seq 1 $SLIDES ); do

        echo "Converting slide $i"

#        swfdec-thumbnailer -s $SIZE $DIR/slide-$i.swf $DIR/slide-$i.png 2>/dev/null

        swfrender $DIR/slide-$i.swf -o $DIR/$i.png 2>/dev/null

    done

    echo "All slides converted."

}

build_pdf() {

    cd $DIR

    IMAGES=`ls *.png | sort -k1.3 `

    echo "Generating PDF…"

    convert $IMAGES +adjoin %d.pdf

    PDFS=`ls *.pdf | sort -k1.3 `

    pdftk $PDFS cat output $DOCSHORT.pdf

    cd ..

    echo "The PDF has been generated."

    echo "Find your presentation in: \"`pwd`/$DIR/$DOCSHORT.pdf\""

}

clean() {

    rm -rf $DIR/*.swf

    rm -rf $DIR/*.png

}

validate_input $1 $2

check_dependencies

build_params $URL

create_env

fetch_slides

convert_slides

build_pdf

clean

    同样的逻辑其实可以适用于百度文库、豆丁这样的在线文库。像百度文库、豆丁之类的在线文库许多资料需要积分下载,但可免费在线浏览,这些文库都使用Flash作为播放器,这就为免积分下载器提供了条件。当然在windows上有冰点文库下载器易读文库下载器这样的图形化工具可用。只不过原理应该也是类似的,看一下冰点文库下载器目录下的SWFToImage.DLL、pdflib.dll两个动态库大致可以知道。

    其实谁有兴趣和精力可以将此类下载功能做成一个单独的在线服务,由此还可以延伸出其他产品功能来,应该还是挺有市场的。

 

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: