Python 網路爬蟲 part2

5月 30, 2015

Python 網路爬蟲part2

最簡單的爬蟲程式

讀csv檔案

來爬yahoo finance

https://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices

# coding=UTF-8

import requests

# r = requests.get("http://example.com")

# # r 代表 response裡面該有的東西

# print r

# print r.content

res = requests.get("http://real-chart.finance.yahoo.com/table.csv?s=%5EGSPC&a=00&b=3&c=1950&d=04&e=24&f=2015&g=d&ignore=.csv")

with open("yahoo_finance.csv","wb") as f:

for chunk in res.iter_content(1024):

f.write(chunk)

import csv

with open("yahoo_finance.csv","r") as csvfile:

reader = csv.reader(csvfile) # 一行一行讀

for r in range(10) :

print(next(reader))

with open("yahoo_finance.csv","r") as csvfile:

reader = csv.DictReader(csvfile) # 讀成有檔頭 key-value 格式

for r in range(10) :

print(next(reader))

ＤＥＭＯ

讀取XML

# coding=UTF-8

# read json

import requests

header = { 'User-Agent' : 'Not a python crawler' }

res = requests.get("https://www.reddit.com/.json", headers=header)

dic= res.json()

print dic["data"]

使用ipython

安裝(for python2)

sudo pip install ipython

使用指令

ipython

使用notebook跑在瀏覽器上指令

ipython notebook

安裝(for python3)

pip3 install ipython

使用指令

ipython3

使用ipython notebook 安裝套件

sudo pip3 install jinja2 tornado jsonschema pyzmq

使用notebook跑在瀏覽器上指令

ipython3 notebook

[偷懶] 直接安裝整包完整的

pip install ipython[all]

爬資料SOP

Step 1

使用網站

找到資料頁面在哪裡

Step 1-2 關掉Javascript 重新 reload 看看資料還在不在

資料99%機率會從SHR 跟 Script

Step 1-3 如果關掉 JS 資料還在就去看GET那邊的資訊

Step 2

尋找connection是在哪裡會把資料撈回來

搜尋此網誌

陳雲濤的部落格

Python 網路爬蟲 part2

留言

張貼留言

這個網誌中的熱門文章

[筆記] pandas 用法 (2) 讀寫檔合併 concat merge 圖表

[筆記] CRLF跟LF之區別 --- 隱形的 bug

[ML筆記] Batch Normalization

[筆記] pandas 用法 (1) 基本功能 indexing 設值

[筆記] 統計實習(1) SAS 基礎用法 (匯入資料並另存SAS新檔，SUBSTR，計算總和與平均，BMI)

[ML筆記] Ensemble - Bagging, Boosting & Stacking

[教學] 使用Audacity分離左右聲道製作伴唱帶