WebCollector-Python 基于 Python 的开源网络爬虫框架开源项目

我要开发同款
匿名用户2019年02月11日
6阅读
开发技术Python
所属分类应用工具、网络爬虫
授权协议GPLv3

作品详情

WebCollector-Python

WebCollector-Python是一个无须配置、便于二次开发的Python爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。

WebCollectorJava版本

WebCollectorJava版相比WebCollector-Python具有更高的效率: https://github.com/CrawlScript/WebCollector

安装pip安装命令pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip示例Basic

demo_auto_news_crawler.py

demo_manual_news_crawler.py

快速入门自动探测URL

demo_auto_news_crawler.py:

# coding=utf-8import webcollector as wcclass NewsCrawler(wc.RamCrawler):    def __init__(self):        super().__init__(auto_detect=True)        self.num_threads = 10        self.add_seed("https://github.blog/")        self.add_regex("+https://github.blog/[0-9]+.*")        self.add_regex("-.*#.*")  # do not detect urls that contain "#"    def visit(self, page, detected):        if page.match_url("https://github.blog/[0-9]+.*"):            title = page.select("h1.lh-condensed")[0].text.strip()            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()            print("\nURL: ", page.url)            print("TITLE: ", title)            print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)手动探测URL

demo_manual_news_crawler.py:

# coding=utf-8import webcollector as wcclass NewsCrawler(wc.RamCrawler):    def __init__(self):        super().__init__(auto_detect=False)        self.num_threads = 10        self.add_seed("https://github.blog/")    def visit(self, page, detected):        detected.extend(page.links("https://github.blog/[0-9]+.*"))        if page.match_url("https://github.blog/[0-9]+.*"):            title = page.select("h1.lh-condensed")[0].text.strip()            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()            print("\nURL: ", page.url)            print("TITLE: ", title)            print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)用detected_filter插件过滤探测到的URL

demo_detected_filter.py:

# coding=utf-8import webcollector as wcfrom webcollector.filter import Filterimport reclass RegexDetectedFilter(Filter):    def filter(self, crawl_datum):        if re.fullmatch("https://github.blog/2019-02.*", crawl_datum.url):            return crawl_datum        else:            print("filtered by detected_filter: {}".format(crawl_datum.brief_info()))            return Noneclass NewsCrawler(wc.RamCrawler):    def __init__(self):        super().__init__(auto_detect=True, detected_filter=RegexDetectedFilter())        self.num_threads = 10        self.add_seed("https://github.blog/")    def visit(self, page, detected):        detected.extend(page.links("https://github.blog/[0-9]+.*"))        if page.match_url("https://github.blog/[0-9]+.*"):            title = page.select("h1.lh-condensed")[0].text.strip()            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()            print("\nURL: ", page.url)            print("TITLE: ", title)            print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)用RedisCrawler进行可断点的采集(可在关闭后恢复)

demo_redis_crawler.py:

# coding=utf-8from redis import StrictRedisimport webcollector as wcclass NewsCrawler(wc.RedisCrawler):    def __init__(self):        super().__init__(redis_client=StrictRedis("127.0.0.1"),                         db_prefix="news",                         auto_detect=True)        self.num_threads = 10        self.resumable = True # you can resume crawling after shutdown        self.add_seed("https://github.blog/")        self.add_regex("+https://github.blog/[0-9]+.*")        self.add_regex("-.*#.*")  # do not detect urls that contain "#"    def visit(self, page, detected):        if page.match_url("https://github.blog/[0-9]+.*"):            title = page.select("h1.lh-condensed")[0].text.strip()            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()            print("\nURL: ", page.url)            print("TITLE: ", title)            print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)用Requests定制Http请求

demo_custom_http_request.py:

#coding=utf-8importwebcollectoraswcfromwebcollector.modelimportPagefromwebcollector.plugin.netimportHttpRequesterimportrequestsclassMyRequester(HttpRequester):defget_response(self,crawl_datum):#customhttprequestheaders={"User-Agent":"Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/76.0.3809.100Safari/537.36"}print("sendingrequestwithMyRequester")#sendrequestandgetresponseresponse=requests.get(crawl_datum.url,headers=headers)#updatecodecrawl_datum.code=response.status_code#wraphttpresponseasaPageobjectpage=Page(crawl_datum,response.content,content_type=response.headers["Content-Type"],http_charset=response.encoding)returnpageclassNewsCrawler(wc.RamCrawler):def__init__(self):super().__init__(auto_detect=True)self.num_threads=10#setrequestertoenableMyRequesterself.requester=MyRequester()self.add_seed("https://github.blog/")self.add_regex("+https://github.blog/[0-9]+.*")self.add_regex("-.*#.*")#donotdetecturlsthatcontain"#"defvisit(self,page,detected):ifpage.match_url("https://github.blog/[0-9]+.*"):title=page.select("h1.lh-condensed")[0].text.strip()content=page.select("div.markdown-body")[0].text.replace("\n","").strip()print("\nURL:",page.url)print("TITLE:",title)print("CONTENT:",content[:50],"...")crawler=NewsCrawler()crawler.start(10)
查看全文
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论