|
刚开始学习爬虫,再网上看到一个爬取猫眼top100的实例,跟着做,但是爬取的结果是' [] ',看了返回的网页,不是top100的源代码,有提到验证
- import requests
- from requests.exceptions import RequestException
- import re
-
- headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
-
- def get_one_page(url):
- try:
- response = requests.get(url, headers=headers)
- if response.status_code == 200:
- return response.text
- return None
- except RequestException:
- return None
-
- def parse_one_page(html):
- pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
- +'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
- +'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
- items = re.findall(pattern, html)
- print(items)
-
- def main():
- url = "https://maoyan.com/board/4?"
- html = get_one_page(url)
- parse_one_page(html)
-
- if __name__ == '__main__':
- main()
复制代码
|
本帖子中包含更多资源
您需要 登录 才可以下载或查看,没有账号?立即注册
x
|