【导读】该项目专为自动爬虫而设计,使爬虫变得容易。它获取网页的url或html内容以及我们要从该页面抓取的示例数据列表。示例数据可以是该页面的文本,URL或任何html标记值。它学习抓取规则并返回相似的元素。然后,可以将此学习到的对象与新的url一起使用,以获取这些新页面的相似内容或完全相同的元素。从而告别爬虫手动解析网页,写规则的烦恼。
Github 地址:
https://github.com/alirezamika/autoscraper
安装
该仓库适用于Python3.
从源安装:
python setup.py install
用pip从git仓库安装
pip install git+https://github.com/alirezamika/autoscraper.git
使用方法
得到相似的结果
from autoscraper import AutoScraper
url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["How to call an external command?"]
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)
输出为
[
'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?',
'How to call an external command?',
'What are metaclasses in Python?',
'Does Python have a ternary conditional operator?',
'How do you remove duplicates from a list whilst preserving order?',
'Convert bytes to a string',
'How to get line count of a large file cheaply in Python?',
"Does Python have a string 'contains' substring method?",
'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]
得到准确的结果
from autoscraper import AutoScraper
url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = ["124.81"]
scraper = AutoScraper()
# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper.build(url, wanted_list)
print(result)
模型保存与加载
保存
scraper.save('yahoo-finance')
加载
scraper.load('yahoo-finance')
生成scraper的Python code
生成关于特定爬虫的Python代码,让其在无该库环境下也能自由使用
code = scraper.generate_python_code()
print(code)
文章评论