AutoScraper: Python 实现的轻量级爬虫

2020年9月5日 491点热度 0人点赞 0条评论

【导读】该项目专为自动爬虫而设计，使爬虫变得容易。它获取网页的url或html内容以及我们要从该页面抓取的示例数据列表。示例数据可以是该页面的文本，URL或任何html标记值。它学习抓取规则并返回相似的元素。然后，可以将此学习到的对象与新的url一起使用，以获取这些新页面的相似内容或完全相同的元素。从而告别爬虫手动解析网页，写规则的烦恼。

Github 地址：

https://github.com/alirezamika/autoscraper

安装

该仓库适用于Python3.

从源安装：

python setup.py install

用pip从git仓库安装

pip install git+https://github.com/alirezamika/autoscraper.git

使用方法

得到相似的结果

from autoscraper import AutoScraper
url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
# We can add one or multiple candidates here.# You can also put urls here to retrieve urls.wanted_list = ["How to call an external command?"]
scraper = AutoScraper()result = scraper.build(url, wanted_list)print(result)

输出为

[    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?',     'How to call an external command?',     'What are metaclasses in Python?',     'Does Python have a ternary conditional operator?',     'How do you remove duplicates from a list whilst preserving order?',     'Convert bytes to a string',     'How to get line count of a large file cheaply in Python?',     "Does Python have a string 'contains' substring method?",     'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?']

得到准确的结果

from autoscraper import AutoScraper
url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = ["124.81"]
scraper = AutoScraper()
# Here we can also pass html content via the html parameter instead of the url (html=html_content)result = scraper.build(url, wanted_list)print(result)

模型保存与加载

保存

scraper.save('yahoo-finance')

加载

scraper.load('yahoo-finance')

生成scraper的Python code

生成关于特定爬虫的Python代码，让其在无该库环境下也能自由使用

code = scraper.generate_python_code()print(code)

专知，专业可信的人工智能知识分发，让认知协作更快更好！欢迎注册登录专知www.zhuanzhi.ai，获取5000+AI主题干货知识资料！

欢迎微信扫一扫加入专知人工智能知识星球群，获取最新AI专业干货知识教程资料和与专家交流咨询！

点击“阅读原文”，了解使用专知，查看获取5000+AI主题知识资源

839800AutoScraper: Python 实现的轻量级爬虫

AutoScraper: Python 实现的轻量级爬虫

文章评论