php抓取ajax页面返回图片。

转载

mob604756f145d3 2013-08-07 14:28:00

文章标签 ajax 数组 php 2d json 文章分类 代码人生

要抓取的页面：http://pic.hao123.com/

当我们往下滚动的时候，图片是用ajax来动态获取的。这就需要我们仔细分析页面了。

php抓取ajax页面返回图片。_数组

可以看到，异步加载的ajax文件为：

http://pic.hao123.com/screen/1?v=1375797699944&act=type

我们之间用浏览器打开这个网址，发现只返回一个空数组[]。但是我们在http://pic.hao123.com/页面调试时发现它

确实返回了一条图片数组：

ajax请求：

php抓取ajax页面返回图片。_ajax_02

返回值可以看到是一个json 数组。说明服务器端会检测referer。我们需要用curl来设置它为正确的值。

请求的v是一个随机数。

php抓取ajax页面返回图片。_ajax_03

每次请求会返回一个有30个元素的array，每个元素都是一个对象，有一个picurl_orig属性，

这个属性就是大图的真实地址。

完整代码如下：

<?php
/*
抓取
http://pic.hao123.com/
图片，
图片是ajax动态载入的。
*/
function replaceBadChar($fileName)
{
    // 去掉文件名中的无效字符,如 \ / : * ? " < > | 
    $fileName=str_replace('\\','_',$fileName);
    $fileName=str_replace('/','_',$fileName);
    $fileName=str_replace(':','_',$fileName);
    $fileName=str_replace("*",'_',$fileName);
    $fileName=str_replace("?",'_',$fileName);
    $fileName=str_replace('"','_',$fileName);
    $fileName=str_replace('<','_',$fileName);
    $fileName=str_replace('>','_',$fileName);
    $fileName=str_replace('|','_',$fileName);
    return $fileName;
}


$dir="images/";
$startTime=microtime(true);
if(!file_exists($dir)) mkdir($dir,0777);
set_time_limit(0);
$i=1;
$j=1;
while($i<10)
{
    /*
    $url='http://pic.hao123.com/screen/'.$i.'?v='.time().'&act=type';
    $file=file_get_contents($url);
    */
    $url='http://pic.hao123.com/screen/'.$i.'?v='.time().'&act=type';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);

    curl_setopt($ch, CURLOPT_REFERER, "http://pic.hao123.com/");   //构造来路


    curl_setopt($ch,CURLOPT_HEADER,0);//不获取header信息
    curl_setopt($ch, CURLOPT_TIMEOUT, 10); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  
    $out = curl_exec($ch);

     curl_close($ch);
    //print_r($out);
    $data=json_decode($out); //array
    //print_r($data);
    //print_r($data[1]->picurl_orig);

    if($data)
    {
        foreach($data as $k=>$v)
        {

            print_r($v->picurl_orig);
            print_r("<br/>");

            $imgUrl=$v->picurl_orig;

            //$newFile=$dir.pathinfo(replaceBadChar($imgUrl),PATHINFO_BASENAME);
             $newFile=replaceBadChar($imgUrl);
             $newFile=$dir.$newFile;
             // $contentType=$_SERVER['CONTENT_TYPE'];不能,undefined index
             $ch2=curl_init($imgUrl);
             curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
             curl_exec($ch2);
             $contentType=curl_getinfo($ch2,CURLINFO_CONTENT_TYPE); 
             //输出如:image/jpeg"
              curl_close($ch2);
             print($contentType."<br/>");
             $ext=substr($contentType,6);
             //有些图片是网址格式如 http://img.hb.aicdn.com/0c6012d12407da6b2adafc4f02779cb013a63a011dc69-x2q4Fz
             $suffix=strrchr($imgUrl,'.');

             print $suffix."<br/>";
             if($suffix!='.jpeg' && $suffix!='.jpg' && $suffix!='.png'  && $suffix!='.gif')
             {
                 $newFile.='.'.$ext;//添加扩展名
             }

             file_put_contents($newFile,file_get_contents($imgUrl));

            //print $i."_".$j.'_'.$imgUrl+" is ok<br/";
            $j++;


        }
    }
    else
    {
            echo 'img is all and usertime '.(microtime(true)-$startTime);
            die();
    }
    $i++;

}

在下载图片时，有些图片是网址格式：

http://img.hb.aicdn.com/0c6012d12407da6b2adafc4f02779cb013a63a011dc69-x2q4F

有些是

http://sxsx.jpg

这个格式，一个有后缀名，一个没有。

这就需要我们检测http页面的Content Type。

用

$contentType=curl_getinfo($ch2,CURLINFO_CONTENT_TYPE);

获取Content -type；

返回形如:

image/jpeg

的形式。

我们就可以

$ext=substr($contentType,6); 获取扩展名。

参考：http://www.php10086.com/2013/01/1278.html

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：五年磨一剑：Java 开源博客 Solo 1.0.0 发布了！

下一篇：Sql2005常用函数大全

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

php抓取ajax页面返回图片。

php抓取ajax页面返回图片。

51CTO博客