Postfix 2.4 stable release 2台linux服务器建立IP隧道
Apr 03
作者: 肖建彬 | 可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明
网址:http://www.xiaojb.com/archives/it/baidu-spider-head.shtml

有人跟我说baidu的spider在抓内容的时候总是先用HEAD,然后用GET,提高了服务器的负载。我今天尝试了一下用HEAD指令取Discuz!论坛的返回信息

[root@host203 etc]# telnet www.discuz.net 80
Trying 61.135.205.104...
Connected to www.discuz.chinacache.net (61.135.205.104).
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.discuz.net

HTTP/1.1 200 OK
Date: Tue, 03 Apr 2007 03:30:25 GMT
Server: Apache
X-Powered-By: PHP/5.2.0
Content-Type: text/html
Via: 1.1 AN-0003011041133540
Set-Cookie: cdb_sid=MoRYfP; expires=Tue, 10-Apr-2007 03:30:25 GMT; path=/; domain=.discuz.net
Set-Cookie: cdb_onlineusernum=7236; expires=Tue, 03-Apr-2007 03:35:25 GMT; path=/; domain=.discuz.net
Connection: close
Via: 1.1 AN-0003011041133546

Connection closed by foreign host.

又试了一下sohu.com的

[root@host203 etc]# telnet www.sohu.com 80
Trying 61.135.150.93...
Connected to pagegrp7.a.sohu.com (61.135.150.93).
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.sohu.com

HTTP/1.0 200 OK
Date: Tue, 03 Apr 2007 03:26:13 GMT
Server: Apache/1.3.33 (Unix) mod_gzip/1.3.19.1a
Vary: Accept-Encoding,X-Up-Calling-Line-id,X-Source-ID,X-Up-Bearer-Type,x-huawei-nasip,x-wap-profile
Cache-Control: max-age=70
Expires: Tue, 03 Apr 2007 03:27:23 GMT
Last-Modified: Tue, 03 Apr 2007 03:11:00 GMT
ETag: "201c035-300bc-4611c5c4"
Accept-Ranges: bytes
Content-Length: 196796
Content-Type: text/html
Age: 49
X-Cache: HIT from 168020036.sohu.com
Via: 1.0 168020036.sohu.com:80 (squid/2.6.STABLE9)
Connection: close

Connection closed by foreign host.

spider使用head指令,是为了得到Last-Modified或者ETag,由此判断网页内容是否更新,而动态网页一般没有这两个header,head指令得到的返回结果对spider没有参考价值,所以我就在Discuz!论坛include/common.inc.php顶部加入了以下代码

if($_SERVER['REQUEST_METHOD'] == 'HEAD') {
        exit();
}

Tags: , ,

3 Responses to “针对baidu蜘蛛HEAD指令对动态网页优化的建议”

  1. xinbin Says:

    除了last-modifed, etag,baidu还关注一个字段 content-length.。 道理也显而易见,如果网页大小发生变化,那么网页也发生了变化。
    我刚才抓包分析了一下discuz.net:
    HTTP/1.1 200 OK
    Server: nginx/0.6.29
    Date: Sat, 12 Apr 2008 03:46:59 GMT
    Content-Type: text/html
    Transfer-Encoding: chunked
    Connection: keep-alive
    X-Powered-By: PHP/5.2.5
    Set-Cookie: dznet_sid=WfAtY2; expires=Sat, 19-Apr-2008 03:46:59 GMT; path=/; domain=.discuz.net

    里面没有content-length ,而现实中大多数的apache+php是有这个content-length这一项的。
    最近我也在研究head,写了一篇blog: http://www.trac.net.cn/2008/04/baiduspider-head.html 可以交流一下

  2. xjb Says:

    Content-Length 标记是否更新并不准确把。

  3. 签名 Says:

    我是新手,对DISCUZ的优化属于文盲,请问在HEAD中,哪些代码可以删除利于优化呢?

Leave a Reply