Apr 03
作者: 肖建彬 | 可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明
网址:http://www.xiaojb.com/archives/it/baidu-spider-head.shtml
网址:http://www.xiaojb.com/archives/it/baidu-spider-head.shtml
有人跟我说baidu的spider在抓内容的时候总是先用HEAD,然后用GET,提高了服务器的负载。我今天尝试了一下用HEAD指令取Discuz!论坛的返回信息
[root@host203 etc]# telnet www.discuz.net 80 Trying 61.135.205.104... Connected to www.discuz.chinacache.net (61.135.205.104). Escape character is '^]'. HEAD / HTTP/1.1 Host: www.discuz.net HTTP/1.1 200 OK Date: Tue, 03 Apr 2007 03:30:25 GMT Server: Apache X-Powered-By: PHP/5.2.0 Content-Type: text/html Via: 1.1 AN-0003011041133540 Set-Cookie: cdb_sid=MoRYfP; expires=Tue, 10-Apr-2007 03:30:25 GMT; path=/; domain=.discuz.net Set-Cookie: cdb_onlineusernum=7236; expires=Tue, 03-Apr-2007 03:35:25 GMT; path=/; domain=.discuz.net Connection: close Via: 1.1 AN-0003011041133546 Connection closed by foreign host.
又试了一下sohu.com的
[root@host203 etc]# telnet www.sohu.com 80 Trying 61.135.150.93... Connected to pagegrp7.a.sohu.com (61.135.150.93). Escape character is '^]'. HEAD / HTTP/1.1 Host: www.sohu.com HTTP/1.0 200 OK Date: Tue, 03 Apr 2007 03:26:13 GMT Server: Apache/1.3.33 (Unix) mod_gzip/1.3.19.1a Vary: Accept-Encoding,X-Up-Calling-Line-id,X-Source-ID,X-Up-Bearer-Type,x-huawei-nasip,x-wap-profile Cache-Control: max-age=70 Expires: Tue, 03 Apr 2007 03:27:23 GMT Last-Modified: Tue, 03 Apr 2007 03:11:00 GMT ETag: "201c035-300bc-4611c5c4" Accept-Ranges: bytes Content-Length: 196796 Content-Type: text/html Age: 49 X-Cache: HIT from 168020036.sohu.com Via: 1.0 168020036.sohu.com:80 (squid/2.6.STABLE9) Connection: close Connection closed by foreign host.
spider使用head指令,是为了得到Last-Modified或者ETag,由此判断网页内容是否更新,而动态网页一般没有这两个header,head指令得到的返回结果对spider没有参考价值,所以我就在Discuz!论坛include/common.inc.php顶部加入了以下代码
if($_SERVER['REQUEST_METHOD'] == 'HEAD') {
exit();
}
April 12th, 2008 at 11:53
除了last-modifed, etag,baidu还关注一个字段 content-length.。 道理也显而易见,如果网页大小发生变化,那么网页也发生了变化。
我刚才抓包分析了一下discuz.net:
HTTP/1.1 200 OK
Server: nginx/0.6.29
Date: Sat, 12 Apr 2008 03:46:59 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.2.5
Set-Cookie: dznet_sid=WfAtY2; expires=Sat, 19-Apr-2008 03:46:59 GMT; path=/; domain=.discuz.net
里面没有content-length ,而现实中大多数的apache+php是有这个content-length这一项的。
最近我也在研究head,写了一篇blog: http://www.trac.net.cn/2008/04/baiduspider-head.html 可以交流一下
April 14th, 2008 at 09:41
Content-Length 标记是否更新并不准确把。
July 16th, 2008 at 23:22
我是新手,对DISCUZ的优化属于文盲,请问在HEAD中,哪些代码可以删除利于优化呢?