simspider 网络爬虫引擎开源项目

我要开发同款
匿名用户2015年02月09日
22阅读
开发技术C/C++
所属分类应用工具、网络爬虫
授权协议LGPL

作品详情

simspider-网络爬虫引擎1.简介simspider是一个轻巧的跨平台的网络爬虫引擎,它提供了一组C函数接口用于快速构建你自己的网络爬虫应用,同时也提供了一个可执行的爬虫程序用于演示函数接口如何使用。simspider只依赖于第三方函数库libcurl。simspider目前支持平台:*UNIX/Linux*WINDOWSsimspider函数接口非常容易使用,主流程如下:*创建爬虫引擎环境*设置爬虫引擎环境*从入口网址递归爬取所有网页*销毁爬虫引擎环境有大量的可选选项用于定制你的爬虫引擎环境,包含但不限于下列:*设置请求队列空间大小*设置感兴趣的文件扩展名集合*是否允许文件扩展名为空*是否允许爬出当前网站*设置最大递归深度*设置HTTPS证书文件名*设置爬取间隔时间*设置爬取最大并发数量simspider爬虫引擎实现了一个灵活的流程框架,提供了相当丰富的回调函数指针给予爬虫应用设计者想要在爬取的任何时间点加入自己自定义的处理逻辑,包含但不限于下列:*构建HTTP请求头时*构建HTTP请求体(往往是POST内容)时*获取到HTTP响应头时*获取到HTTP响应体(往往是HTML)时(在以上4个回调函数中,爬虫应用设计者可以使用另外一批simspider函数接口得到上层网址、当前网址、响应码、递归深度、CURL对象以及HTTP缓冲区等信息)*爬取完成后检阅完成队列2.我的第一个爬虫程序使用simspider爬虫引擎函数库实现一个爬虫应用相当容易,以下是一个简单示例:[code]#include"libsimspider.h"intmain(){   structSimSpiderEnv   *penv=NULL;   int         nret=0;      nret=InitSimSpiderEnv(&penv,NULL);   if(nret)   {      printf("InitSimSpiderEnvfailed[%d]\n",nret);      return1;   }      nret=SimSpiderGo(penv,"","https://localhost/");   if(nret)   {      printf("SimSpiderGofailed[%d]\n",nret);      return1;   }      CleanSimSpiderEnv(&penv);      return0;}[/code]

...

6.自带爬虫运行演示安装包中自带了一个爬虫src/simspider.c,运行如下:(家用台机PC中的虚拟机VMWARE的RedHatEnterpriseLinuxServerrelease5.4环境爬取外面WINDOWSXPApache中的curl手册文档)[code]$time./simspider192.168.6.795>>>[https://192.168.6.79/]>>>[https://192.168.6.79/curl-config.html]>>>[https://192.168.6.79/TheArtOfHttpScripting]>>>[https://192.168.6.79/libcurl/index.html]>>>[https://192.168.6.79/index.html]>>>[https://192.168.6.79/libcurl/libcurl.html]>>>[https://192.168.6.79/libcurl/libcurl-easy.html]>>>[https://192.168.6.79/libcurl/libcurl-multi.html]>>>[https://192.168.6.79/libcurl/libcurl-share.html]>>>[https://192.168.6.79/libcurl/libcurl-errors.html]>>>[https://192.168.6.79/curl.html]>>>[https://192.168.6.79/libcurl/curl_easy_cleanup.html]>>>[https://192.168.6.79/libcurl/curl_easy_duphandle.html]>>>[https://192.168.6.79/libcurl/curl_easy_escape.html]>>>[https://192.168.6.79/libcurl/curl_easy_getinfo.html]>>>[https://192.168.6.79/libcurl/curl_easy_init.html]>>>[https://192.168.6.79/libcurl/curl_easy_pause.html]>>>[https://192.168.6.79/libcurl/curl_easy_perform.html]>>>[https://192.168.6.79/libcurl/curl_easy_recv.html]>>>[https://192.168.6.79/libcurl/curl_easy_reset.html]>>>[https://192.168.6.79/libcurl/curl_easy_strerror.html]>>>[https://192.168.6.79/libcurl/curl_easy_unescape.html]>>>[https://192.168.6.79/libcurl/curl_escape.html]>>>[https://192.168.6.79/libcurl/curl_formadd.html]>>>[https://192.168.6.79/libcurl/curl_formfree.html]>>>[https://192.168.6.79/libcurl/curl_formget.html]>>>[https://192.168.6.79/libcurl/curl_free.html]>>>[https://192.168.6.79/libcurl/curl_getenv.html]>>>[https://192.168.6.79/libcurl/curl_easy_send.html]>>>[https://192.168.6.79/libcurl/curl_global_cleanup.html]>>>[https://192.168.6.79/libcurl/curl_global_init.html]>>>[https://192.168.6.79/libcurl/curl_global_init_mem.html]>>>[https://192.168.6.79/libcurl/curl_mprintf.html]>>>[https://192.168.6.79/libcurl/curl_multi_add_handle.html]>>>[https://192.168.6.79/libcurl/curl_multi_assign.html]>>>[https://192.168.6.79/libcurl/curl_multi_cleanup.html]>>>[https://192.168.6.79/libcurl/curl_multi_fdset.html]>>>[https://192.168.6.79/libcurl/curl_getdate.html]>>>[https://192.168.6.79/libcurl/curl_multi_info_read.html]>>>[https://192.168.6.79/libcurl/curl_multi_init.html]>>>[https://192.168.6.79/libcurl/curl_multi_perform.html]>>>[https://192.168.6.79/libcurl/curl_multi_remove_handle.html]>>>[https://192.168.6.79/libcurl/curl_multi_setopt.html]>>>[https://192.168.6.79/libcurl/curl_multi_socket.html]>>>[https://192.168.6.79/libcurl/curl_multi_socket_action.html]>>>[https://192.168.6.79/libcurl/curl_multi_strerror.html]>>>[https://192.168.6.79/libcurl/curl_multi_timeout.html]>>>[https://192.168.6.79/libcurl/curl_share_cleanup.html]>>>[https://192.168.6.79/libcurl/curl_share_init.html]>>>[https://192.168.6.79/libcurl/curl_share_setopt.html]>>>[https://192.168.6.79/libcurl/curl_share_strerror.html]>>>[https://192.168.6.79/libcurl/curl_slist_append.html]>>>[https://192.168.6.79/libcurl/curl_slist_free_all.html]>>>[https://192.168.6.79/libcurl/curl_strequal.html]>>>[https://192.168.6.79/libcurl/curl_unescape.html]>>>[https://192.168.6.79/libcurl/curl_version.html]>>>[https://192.168.6.79/libcurl/curl_version_info.html]>>>[https://192.168.6.79/libcurl/]>>>[https://192.168.6.79/libcurl/libcurl-tutorial.html]>>>[https://192.168.6.79/libcurl/curl_easy_setopt.html][ 200][1][][https://192.168.6.79/][ 200][2][https://192.168.6.79/][https://192.168.6.79/TheArtOfHttpScripting][ 200][2][https://192.168.6.79/][https://192.168.6.79/curl-config.html][ 200][2][https://192.168.6.79/][https://192.168.6.79/curl.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/index.html][ 200][4][https://192.168.6.79/libcurl/curl_easy_getinfo.html][https://192.168.6.79/libcurl/][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_cleanup.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_duphandle.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_escape.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_getinfo.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_init.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_pause.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_perform.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_recv.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_reset.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_send.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_setopt.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_strerror.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_easy_unescape.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_escape.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_formadd.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_formfree.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_formget.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_free.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_getdate.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_getenv.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_global_cleanup.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_global_init.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_global_init_mem.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_mprintf.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_add_handle.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_assign.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_cleanup.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_fdset.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_info_read.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_init.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_perform.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_remove_handle.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_setopt.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_socket.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_socket_action.html][ 404][4][https://192.168.6.79/libcurl/curl_multi_socket.html][https://192.168.6.79/libcurl/curl_multi_socket_all.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_strerror.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_multi_timeout.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_share_cleanup.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_share_init.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_share_setopt.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_share_strerror.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_slist_append.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_slist_free_all.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_strequal.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_unescape.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_version.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/curl_version_info.html][ 200][2][https://192.168.6.79/][https://192.168.6.79/libcurl/index.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/libcurl-easy.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/libcurl-errors.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/libcurl-multi.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/libcurl-share.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/libcurl-tutorial.html][ 200][3][https://192.168.6.79/libcurl/index.html][https://192.168.6.79/libcurl/libcurl.html]real   0m0.452suser   0m0.062ssys    0m0.360s[/code]7.最后是不是越看越心动了?那就赶紧下载来玩玩吧如有问题或建议欢迎联系我^_^开源项目首页:https://git.oschina.net/calvinwilliams/simspider作者邮箱    :calvinwilliams.c@gmail.com

查看全文
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论