python 爬虫学习

news/2024/7/19 19:36:09 标签: python, 网络爬虫

Beautiful Soup 4.4.0 文档

参考《Python 网络数据采集》

爬虫初步

安装BeautifulSoup( 非python 的标准库,需要单独安装)

linux环境下:

sudo apt-get install python-bs4

Mac环境下:

sudo easy_install pip 

pip是一个包管理器

pip install beautifulsoup4

如果你同时安装了python 2.x 和 python 3.x 你需要指明python3去运行你写的爬虫的文件

比如 : python3 test_urlopen.py

你安装beautifulsoup的时候可能安装到了python 2.x 而不是 python 3.x ,需要使用:

sudo python3 set.up install

pip3 install beautifulsoup4

Windows环境下:

python3 setup.py install 

在终端进入python环境,测试:

from bs4 import BeautifulSoup

如果没有报错就说明导入成功了

否则: 下载windows版本的pip (http://pypi.python.org/pypi/setuptools)

pip install beautilfulsoup4

 错误处理:

 UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

解决办法: 当你创建一个BeautifulSoup的对象的时候,需要选择一个html 解析器 比如 parser(自带), lxml (需要安装)。

Traceback (most recent call last):

  File "test_BeautifulSoup.py", line 1, in <module>

    from urllib.request import urlopen

ImportError: No module named request

python3 xxx.py

importError 一般都是缺少库 ,只要运行一下pip3 install xxx就行了

 

python">from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj=BeautifulSoup(html,"html.parser")
nameList=bsObj.findAll("span", {"class":"green"})
for name in nameList:
 print(name.get_text())

效果图:

python">from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html=urlopen(url)
    except (HTTPError,URLError) as e:
        return None
    try:
        bsObj=BeautifulSoup(html.read(), "html.parser")
        title=bsObj.body.h1
    except AttributeError as e:
        return none
    return title
title=getTitle("http://www.pythonscraping.com/pages/page1.html")
if title==None:
    print("Title could not be found!")
else:
    print(title)

效果图: 

find 和 findall 的区别及用法

findAll(tag, attributes,recursive, text, limit, keywords) recursive = true(默认)支持递归查找

find(tag, attributes,recursive, text, keywords) 等价于 findAll中limit=1 范围限制 前limit项

python">from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html, "html.parser")
for child in bsObj.find("table", {"id":"giftList"}).children:
   print(child)

效果图:

python">from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html, "html.parser")
images = bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:  
 print(image["src"])

效果图:

python">from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("https://en.wikipedia.org/wiki/Eric_Idle")
bsObj = BeautifulSoup(html, "html.parser")
for link in bsObj.findAll("a"):
 if 'href' in link.attrs:
  print(link.attrs['href'])

python">from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random 
import re
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
 html=urlopen("https://en.wikipedia.org"+articleUrl)
 bsObj=BeautifulSoup(html, "html.parser")
 return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
while len(links)>0:
 newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
 print(newArticle)
 links = getLinks(newArticle)

python">from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())
def getInternalLinks(bsObj, includeUrl):
 internalLinks=[]
 for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
  if link.attrs['href'] not in internalLinks:
   internalLinks.append(link.attrs['href'])
 return internalLinks
def getExternalLinks(bsObj, excludeUrl):
 externalLinks = []
 for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
  if link.attrs['href'] not in externalLinks:
   externalLinks.append(link.attrs['href'])
 return externalLinks
def splitAddress(address):
 addressParts = address.replace("http://", "").split("/")
 return addressParts

def getRandomExternalLink(startingPage):
 html = urlopen(startingPage)
 bsObj = BeautifulSoup(html, "html.parser")
 externalLinks  = getExternalLinks(bsObj, splitAddress(startingPage)[0])
 if len(externalLinks) == 0:
   internalLinks = getInternalLinks(startingPage)
   return getNextExternalLink(internalLinks[random.randint(0,len(interalLinks)-1)])
 else:
  return externalLinks[random.randint(0, len(externalLinks)-1)]

def followExternalOnly(startingSite):
 externalLink = getRandomExternalLink("https://oreilly.com")
 print("随即外链:"+ externalLink)
 followExternalOnly(externalLink)

followExternalOnly("https://oreilly.com")

ctrl+c结束

python">import os 
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source):
 if source.startswith("http://www."):
  url = "http://"+source[11:]
 elif source.startswith("http://"):
  url = source
 elif suorce.startswith("www."):
  url=source[4:]
  url="http://"+source
 else:
  url = baseUrl+"/"+source
 if baseUrl not in url:
  return None
 return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
 path=absoluteUrl.replace("www.", "")
 path=path.replace(baseUrl, "")
 path=downloadDirectory+path
 directory=os.path.dirname(path)
 
 if not os.path.exists(directory):
  os.makedirs(directory)
 return path
 
html = urlopen("http://www.pythonscraping.com")
bsObj=BeautifulSoup(html, "html.parser")
downloadList = bsObj.findAll(src=True)

for download in downloadList:
 fileUrl = getAbsoluteURL(baseUrl, download["src"])
 if fileUrl is not None:
  print(fileUrl)

urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

会在当前文件创建一个downloaded 目录  /downloaded/img  抓取下来的图片保存在这里。

Python 连接数据库

1.需要安装mysql , 成功安装后。

2.安装pymysql 模块

可以直接使用pip包管理工具

pip3 install pymysql

或者使用github上的

curl -L https://github.com/PyMySQL/PyMySQL/tarball/pymysql-0.6.2 | tar xz

cd PyMySQL-PyMySQL-f953785 

python3 setup.py install(如果用的是python2.x 把3去掉)

sudo mysql -uroot -p   

进入数据库后

create database scraping;

use scraping;

create table pages (id BIGINT(7) not null AUTO_INCREMENT, title varchar(200), content varchar(10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, primary key(id));

 

python">import pymysql

conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd='000000', db='mysql')
cur = conn.cursor()
cur.execute("USE scraping")
cur.execute("SELECT *FROM pages where id=1")
print(cur.fetchone())
cur.close()
conn.close()
python">from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random
import pymysql

conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd='000000', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("USE scraping")
random.seed(datetime.datetime.now())

def store(title, content):
 cur.execute("INSERT INTO pages (title, content) VALUES (\"%s\",\"%s\")",(title, content))
 cur.connection.commit()

def getLinks(articleUrl):
 html = urlopen("https://en.wikipedia.org"+articleUrl)
 bsObj=BeautifulSoup(html, "html.parser")
 title = bsObj.find("h1").get_text()
 content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text()
 store(title, content)
 return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")
try:
 while len(links)>0:
  newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
  print(newArticle)
  links = getLinks(newArticle)
finally:
  cur.close()
  conn.close()

CREATE DATABASE wikipedia;

use wikipedia;

CREATE TABLE pages( id INT NOT NULL AUTO_INCREMENT, url varchar(255) not NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));

 CREATE TABLE link( id INT NOT NULL AUTO_INCREMENT, fromPageId INT NULL, toPageId INT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));

python">from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import pymysql
conn=pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock', user='root', passwd='000000', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("USE wikipedia")

def insertPageIfNotExists(url):
 cur.execute("SELECT *FROM pages WHERE url = %s", (url))
 if cur.rowcount == 0:
  cur.execute("INSERT INTO pages (url) VALUES  (%s)", (url))
  conn.commit()
  return cur.lastrowid
 else:
  return cur.fetchone()[0]

def insertLink(fromPageId, toPageId):
 cur.execute("SELECT * FROM link WHERE fromPageId = %s AND toPageId = %s", ((fromPageId,(toPageId))))
 if cur.rowcount == 0:
   cur.execute("INSERT INTO link (fromPageId, toPageId) VALUES (%s, %s)", ((fromPageId, (toPageId))))
   conn.commit()

pages = set()
def getLinks(pageUrl, recursionLevel):
 global pages
 if recursionLevel > 4:
  return ;
 pageId = insertPageIfNotExists(pageUrl)
 html = urlopen("https://en.wikipedia.org"+pageUrl)
 bsObj = BeautifulSoup(html, "html.parser")
 for link in bsObj.findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
  insertLink(pageId, insertPageIfNotExists(link.attrs['href']))
  if link.attrs['href'] not in pages:
   newPage = link.attrs['href']
   pages.add(newPage)
   getLinks(newPage, recursionLevel+1)
getLinks("/wiki/Kevin_Bacon", 0)
cur.close()
conn.close()

打开pdf文件

python">from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open
    
def readPDF(pdfFile):
 rsrcmgr = PDFResourceManager()
 retstr = StringIO()
 laparams = LAParams()
 device = TextConverter(rsrcmgr, retstr, laparams=laparams)
 process_pdf(rsrcmgr, device, pdfFile)
 device.close()
 content = retstr.getvalue()
 retstr.close()
 return content

pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()

提交文件:

python">import requests
files = {'uploadFile': open('./files/test.png','rb')}
r=requests.post("http://pythonscraping.com/pages/processing2.php", files=files)
print(r.text)

提交表单的三种方法:

python">#python_http.py
import requests

from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth('ryan', 'password')
r=requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=auth)
print(r.text)

#python_requests.py
import requests

params = {'username':'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("--------------")
print("Going to profile page...")
r=requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies=r.cookies)
print(r.text)

#python_session.py
import requests

session = requests.Session()

params = {'username':'username', 'password':'password'}
s=session.post("http://pythonscraping.com/pages/cookie/welcome.php", params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print("---------")
print("Going to profile page...")
s=session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

伪造http的头部

python">import requests
from bs4 import BeautifulSoup

session = requests.Session()
headers={"User-Agent":"Mozillla/5.0 (Macintosh;Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept":"text/html, application/xhtml+xml, application/xml;q=0.9,image/webp,*/*;q=0.8"}
url="https://www.whatismybrowser.com/developers/what-http-headers-is-my-browser-sending"
req=session.get(url, headers=headers)
bsObj=BeautifulSoup(req.text, "html.parser")
print(bsObj.find("table", {"class":"table-striped"}).get_text)

 


http://www.niftyadmin.cn/n/1686949.html

相关文章

CCF 201312-4 有趣的数_数位DP

CCF 201312-4 有趣的数 传送门 这道题似乎就是所谓数位DP, 如果没有接触过这一类题目, 真的是很难会想出完整的方法, 毕竟有6个状态, 然后在状态之间进行递推. 我们定义六种状态, s0-s5, 对每个长度的状态我们都从其他可能推过来的状态推过来. 为什么是六种? 这六种状态是哪…

【MATLAB生信分析】MATLAB生物信息分析工具箱(一)

这里给出 MATLAB_R2017a 的生物信息学工具箱中 自带的样例 一览&#xff1a; 如果是默认安装&#xff0c;则相关目录在&#xff1a; C:\Program Files\MATLAB\R2017a\examples\bioinfo 按字母排序如下&#xff1a; AlignMultipleSequencesExample.m AlignQuerySequenceToProfil…

【MATLAB生信分析】MATLAB生物信息分析工具箱(二)

AlignMultipleSequencesExample.m 蛋白质序列-多重比对 AlignQuerySequenceToProfileUsingHMMModelAlignmentExample.m HMM模型实现用户序列-比对 AUCNormalizationExample.m 质谱曲线下面积 (AUC) 进行归一化 BuildPhylogeneticTreefromPairwiseDistancesExample.m 序列成对…

浏览器屏蔽百度热搜

引言火狐浏览器解决方案安装方式1. 在浏览器右上角菜单处, 找到附加组件2. 搜索Adblock Hyper并安装 屏蔽方式1. 首先, 点击广告插件, 点击Block Element2. 移动到百度热搜那一栏, 点击3. 会弹出这样一个页面, 选择创建就好了4. 屏蔽成功 Chrome浏览器解决方案安装插件方式推荐…

木桩涂涂看_树状数组

习题&#xff1a;木桩涂涂看nn个木桩排成一排,从左到右依次编号为 1,2,3...n" role="presentation">1,2,3...n1,2,3...n。每次给定 22个整数 a" role="presentation">aa&#xff0c;bb(a&#x2264;b" role="presentation&qu…

C/C++基础 (time)

如何才能快速入坑 C/C&#xff1f;曾经我也是抱着《C Primer 中文版》和《Effective C 中文版》。但是对于初学者&#xff0c;它们并不是最好的选择&#xff0c;这里推荐《C/C学习指南》一本非常适合入门的指导书。分享学习时候写的代码。 C:\Users\LAILAI\Desktop>g --ver…

洛谷P3150 pb的游戏(1)_博弈论入门

题目背景 &#xff08;原创&#xff09; 有一天 pb和zs玩游戏 你需要帮zs求出每局的胜败情况 题目描述 游戏规则是这样的&#xff1a; 每次一个人可以对给出的数进行分割&#xff0c;将其割成两个非零自然数&#xff0c;之后由另一个人选择留下两个数中的其中一个&#xff1b;之…

c++ 指针指向常量字符串和作为函数的形参

空悬指针&#xff1a;它曾经指向一个有效地址&#xff0c;但是现在不再指向有效地址&#xff0c;就是原来的那块地址不能通过这个指针区访问了。这通常是因为指针所指的内存单位被释放了并且不再有效了。空悬指针存在并没有什么问题&#xff0c;除非你尝试通过这个指针访问指向…