发布时间:2025-06-24 18:38:27 作者:北方职教升学中心 阅读量:913
In this blog, I will demonstrate the specific usage of the requests and Beautiful Soup libraries through concrete practical cases.
Example website: 猫眼电影 TOP100榜(TOP100榜 - 猫眼电影 - 一网打尽好电影)
Web Crawling Preparation
Before crawling the website, we should ensure that the website's terms of service allow for web crawling, and respect any robots.txt file restrictions to avoid the risk of infringement.
Since the webpage's robots.txt file is stored in the root directory of the website, we can directly access it by appending ‘/robots.txt’ to the root domain.
Example:
import requestsRobotsFile = 'https://www.maoyan.com/robots.txt'response = requests.get(RobotsFile)if response.status_code == 200: print(response.text)else: print('Failed to get robots.txt')
Output:
- User-agent:The crawler(s) to which the rule applies('User-agent: *' reprents that the rule applies to all crawlers)
- Disallow: The paths that crawlers are not allowed to access
- Allow: The paths that crawlers are allowed to access, even if they might be blocked elsewhere
Based on the above results, we can see that website restricts the crawling of path including '?utm_source'. The content we are going to scrape is not within the restricted range, so we can proceed with scraping it.
Crawling Webpages with the requests Library
1. Making Basic Requests with GET
At the beginning, we directly use the requests.get()function to attempt crawling the webpage.
Example:
RankLink = requests.get("https://www.maoyan.com/board/4")print(RankLink.text)
Output(Part omitted):
Based on the output, we can see that we can retrieve the HTML content of the webpage with a simple GET request. However, in some cases, using the requests.get() function directly will return a 403status code. This situation is usually caused by reasons such as missing proper request headers, IP address being blocked and so on. Especially the missing request headers. Many websites check the User-Agentfield in the request headers to determine whether the request is coming from a browser. If the User-Agent is not set, the server might assume the request is from an automated tool (like a scraper) and deny access.
Therefore, I will attempt again by adding request headers.
2. Adding Request Headers
I added request headers by including the headersparameter in the requests.get()function and then retried the GET request.
Example:
headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}RankLink = requests.get('https://www.maoyan.com/board/4', headers=headers)
By adding a User-Agent request header in requests.get(), simulating a browser request, we can avoid the server treating the request as coming from an automated tool and rejecting access.
3. Adjusting the Code Based on the Pagination Method
In the above content, we have successfully made a request to the webpage, but the returned content only includes data from the first page. Therefore, we need to adjust the code based on the webpage's paginationmethod.
We found that the URL suffix changed regularly after turning the page. Therefore, we automatically build a new URL using the paramsparameter in the GET request with a for loop.
Example:
import numpy as npfor i in np.arange(0, 101, 10): data = { 'offset': i } RankLink = requests.get('https://www.maoyan.com/board/4', params=data)
Because frequent requests for Maoyan web pages from the same IP within a short period of time will trigger the verification mechanism(as shown in the picture) of the web page. Therefore, we use a proxy poolto avoid triggering the verification mechanism.
Example:
import randomproxies_list = [ 'http://47.122.65.254:8080', 'http://8.130.34.44:8800', 'http://47.121.183.107:8443']headers_list = [ 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0', 'User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1']for i in np.arange(0, 101, 10): data = { 'offset': i } headers = { 'User-Agent': random.choice(headers_list) } proxy = { 'http': random.choice(proxies_list) } RankLink = requests.get('https://www.maoyan.com/board/4', params=data, headers=headers, proxies=proxy) print(RankLink.text)
Since each for loop outputs the HTML code of a page of web pages, we store the code of each page in an empty list in the form of a list, so that it can be parsed later using the Beautiful Soup library.
The final code is as follows:
import requestsimport numpy as npimport randomproxies_list = [ 'http://47.122.65.254:8080', 'http://8.130.34.44:8800', 'http://47.121.183.107:8443']headers_list = [ 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0', 'User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1']output = []for i in np.arange(0, 101, 10): data = { 'offset': i } headers = { 'User-Agent': random.choice(headers_list) } proxy = { 'http': random.choice(proxies_list) } RankLink = requests.get('https://www.maoyan.com/board/4', params=data, headers=headers, proxies=proxy) output.append(RankLink.text)
Parsing the Crawled Content with the Beautiful Soup Library
After crawling the webpage with the requests library, we found that the scraped result was cluttered and complex, containing a lot of irrelevant content we didn't need. Therefore, we used the Beautiful Souplibrary to parse the scraped HTML code.
1. Initializing the BeautifulSoup Object
First, we need to initialize the BeautifulSoup object. Here we take the first web page in the list as an example.
Example:
from bs4 import BeautifulSoup soup = BeautifulSoup(output[1], 'lxml')print(soup.prettify())
Output(Part omitted):
2. Using CSS Selector to Get Content
According to the standard indented HTML code output by the method .prettify(), we can find that all information about a movie is stored in the 'dd' node. Therefore, we use the descendantsattribute to return all descendant nodes of the dd node.
Example:
Rank = []Title = []Star = []Time = []Score_integer = []Score_fraction = []Score = []for a,b in enumerate(soup.select(".board-index")): Element = BeautifulSoup(str(b),"lxml") Rank.append(Element.text) for a,b in enumerate(soup.select(".name")): Element = BeautifulSoup(str(b),"lxml") Title.append(Element.text)for a,b in enumerate(soup.select(".star")): Element = BeautifulSoup(str(b),"lxml") Star.append(Element.text.strip().replace('主演:', ''))for a,b in enumerate(soup.select(".releasetime")): Element = BeautifulSoup(str(b),"lxml") Time.append(Element.text.strip().replace('上映时间:', ''))for a,b in enumerate(soup.select(".integer")): Element = BeautifulSoup(str(b),"lxml") Score_integer.append(Element.text)for a,b in enumerate(soup.select(".fraction")): Element = BeautifulSoup(str(b),"lxml") Score_fraction.append(Element.text)for i in range(0,len(Score_integer)): Score.append(f"{Score_integer[i]}{Score_fraction[i]}")print(Rank)print(Title)print(Star)print(Time)print(Score)
Output:
Because only the data of the first web page is obtained here, we use the for loop again to obtain all the movie data of the 10 pages.
The final code is as follows:
from bs4 import BeautifulSoupRank = []Title = []Star = []Time = []Score_integer = []Score_fraction = []Score = []for i in output: soup = BeautifulSoup(i, 'lxml') temp_rank = [] temp_title = [] temp_star = [] temp_time = [] temp_score_integer = [] temp_score_fraction = [] for a, b in enumerate(soup.select(".board-index")): Element = BeautifulSoup(str(b), "lxml") temp_rank.append(Element.text) for a, b in enumerate(soup.select(".name")): Element = BeautifulSoup(str(b), "lxml") temp_title.append(Element.text) for a, b in enumerate(soup.select(".star")): Element = BeautifulSoup(str(b), "lxml") temp_star.append(Element.text.strip().replace('主演:', '')) for a, b in enumerate(soup.select(".releasetime")): Element = BeautifulSoup(str(b), "lxml") temp_time.append(Element.text.strip().replace('上映时间:', '')) for a, b in enumerate(soup.select(".integer")): Element = BeautifulSoup(str(b), "lxml") temp_score_integer.append(Element.text) for a, b in enumerate(soup.select(".fraction")): Element = BeautifulSoup(str(b), "lxml") temp_score_fraction.append(Element.text) for i in range(len(temp_score_integer)): Score.append(f"{temp_score_integer[i]}{temp_score_fraction[i]}") Rank.extend(temp_rank) Title.extend(temp_title) Star.extend(temp_star) Time.extend(temp_time) print(Rank)print(Title)print(Star)print(Time)print(Score)
Output:
Storing the Crawled Results in MySQL
Above we have obtained all the movie information through Beautiful Soup, and next we store them in the database through the PyMySQL library.
1. Connecting to MySQL and Create a Table
First, we need to use pymysql.connect()to declare a MySQL connection object, and then use cursor.execute()and SQL query to create a new database spiders and a new table Film.
Example:
import pymysqldb = pymysql.connect( host='localhost', user='root', password='123456', port=3306)cursor = db.cursor()cursor.execute("CREATE DATABASE IF NOT EXISTS spiders;")cursor.execute("USE spiders;")sql = """CREATE TABLE IF NOT EXISTS Film ( FilmRank VARCHAR(255) NOT NULL, FilmTitle VARCHAR(255) NOT NULL, FilmStar VARCHAR(255) NOT NULL, FilmTime VARCHAR(255) NOT NULL, FilmScore VARCHAR(255) NOT NULL, PRIMARY KEY (FilmRank))"""cursor.execute(sql)db.close()
2. Inserting Data into the Database
When inserting data into a database, we can construct an SQL statement dynamically using placeholdersto avoid manual SQL string concatenation.
Example:
import pymysqldb = pymysql.connect( host='localhost', user='root', password='123456', port=3306, db='spiders')cursor = db.cursor()for i in range(len(Rank)): FilmRank = Rank[i] FilmTitle = Title[i] FilmStar = Star[i] FilmTime = Time[i] FilmScore = Score[i] sql = 'INSERT INTO Film(FilmRank, FilmTitle, FilmStar, FilmTime, FilmScore) VALUES (%s, %s, %s, %s, %s)' try: cursor.execute(sql, (FilmRank, FilmTitle, FilmStar, FilmTime, FilmScore)) db.commit() except Exception as e: db.rollback() print(f"Failed to insert data at index {i}, Error: {e}")db.close()
Finally, we print the contents of the table Film line by line to check whether the data is added successfully.
Example:
import pymysqldb = pymysql.connect( host='localhost', user='root', password='123456', port=3306, db='spiders')cursor = db.cursor()cursor.execute("Select * FROM Film;")columns = cursor.fetchall()for column in columns: print(column)db.close()
Output(Part omitted):