728x90
금 시세와 달러 환율 데이터 크롤링¶
- 둘 사이의 종가 및 등락율을 분석해서 둘 사이의 상관관계를 알아보는 EDA를 가정합니다.
In [4]:
from urllib.request import urlopen
import requests
import bs4
import pandas as pd
국제 금 시세 데이터 크롤링¶
In [3]:
# https://finance.naver.com/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=1
index_cd = "CMDT_GC"
page_n = 1
naver_index = f"https://finance.naver.com/marketindex/worldDailyQuote.nhn?marketindexCd={index_cd}&fdtc=2&page={page_n}"
naver_index
Out[3]:
'https://finance.naver.com/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=1'
- url을 불러와서 데이터 열어보기
In [7]:
src = urlopen(naver_index).read()
- BeautifulSoup의 힘을 빌려서 parser로 parsing 하기
In [8]:
source = bs4.BeautifulSoup(src, 'lxml')
source
Out[8]:
<html lang="ko">
<head>
<title>네이버 금융</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<link href="https://ssl.pstatic.net/imgstock/static.pc/20210721200146/css/finance.css" rel="stylesheet" type="text/css"/>
<script language="javascript">document.domain="naver.com";</script>
<script src="https://ssl.pstatic.net/imgstock/static.pc/20210721200146/js/info/jindo.min.ns.1.5.3.euckr.js" type="text/javascript"></script>
<script src="https://ssl.pstatic.net/imgstock/static.pc/20210721200146/js/lcslog.js" type="text/javascript"></script>
</head>
<body>
<div class="section_exchange2">
<h3 class="h_today"><span>일별 시세</span></h3>
<table border="1" class="tbl_exchange today" summary="일별 시세 리스트">
<caption>일별 시세</caption>
<colgroup>
<col width="85"/>
<col width="83"/>
<col width="83"/>
<col width="82"/>
</colgroup>
<thead>
<tr>
<th class="th_today1"><span>날짜</span></th>
<th class="th_today2"><span>파실 때</span></th>
<th class="th_today3"><span>보내실 때 </span></th>
<th class="th_today4"><span>받으실 때</span></th>
</tr>
</thead>
<tbody>
<tr class="up">
<td class="date">
2021.07.29
</td>
<td class="num">
1,831.20
</td>
<td class="num">
<img alt="상승" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_up.gif" width="7"/> 31.50
</td>
<td class="num">
+1.75%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.28
</td>
<td class="num">
1,799.70
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/> 0.10
</td>
<td class="num">
0.00%
</td>
</tr>
<tr class="up">
<td class="date">
2021.07.27
</td>
<td class="num">
1,799.80
</td>
<td class="num">
<img alt="상승" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_up.gif" width="7"/> 0.60
</td>
<td class="num">
+0.03%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.26
</td>
<td class="num">
1,799.20
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/> 2.60
</td>
<td class="num">
-0.14%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.23
</td>
<td class="num">
1,801.80
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/> 3.60
</td>
<td class="num">
-0.19%
</td>
</tr>
<tr class="up">
<td class="date">
2021.07.22
</td>
<td class="num">
1,805.40
</td>
<td class="num">
<img alt="상승" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_up.gif" width="7"/> 2.00
</td>
<td class="num">
+0.11%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.21
</td>
<td class="num">
1,803.40
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/> 8.00
</td>
<td class="num">
-0.44%
</td>
</tr>
</tbody>
</table>
<!-- paging -->
<div class="paging">
<a class="on" href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=1" onclick="parent.clickcr(this,'med.pagination','','',event);">1</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=2" onclick="parent.clickcr(this,'med.pagination','','',event);">2</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=3" onclick="parent.clickcr(this,'med.pagination','','',event);">3</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=4" onclick="parent.clickcr(this,'med.pagination','','',event);">4</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=5" onclick="parent.clickcr(this,'med.pagination','','',event);">5</a>
<a class="next" href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=6" onclick="parent.clickcr(this,'med.pagination','','',event);">다음 <img alt="" border="0" height="5" src="https://ssl.pstatic.net/static/nfinance/bu_pgarR.gif" width="3"/></a>
</div>
</div>
<script type="text/javascript">
</script>
</body>
</html>
- prettify 함수를 사용하면 tab 기준으로 "이쁘게" 보인다는데 아직 아는 게 없어서 차이점을 못 느끼겠습니다.
In [10]:
print(source.prettify())
<html lang="ko">
<head>
<title>
네이버 금융
</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<link href="https://ssl.pstatic.net/imgstock/static.pc/20210721200146/css/finance.css" rel="stylesheet" type="text/css"/>
<script language="javascript">
document.domain="naver.com";
</script>
<script src="https://ssl.pstatic.net/imgstock/static.pc/20210721200146/js/info/jindo.min.ns.1.5.3.euckr.js" type="text/javascript">
</script>
<script src="https://ssl.pstatic.net/imgstock/static.pc/20210721200146/js/lcslog.js" type="text/javascript">
</script>
</head>
<body>
<div class="section_exchange2">
<h3 class="h_today">
<span>
일별 시세
</span>
</h3>
<table border="1" class="tbl_exchange today" summary="일별 시세 리스트">
<caption>
일별 시세
</caption>
<colgroup>
<col width="85"/>
<col width="83"/>
<col width="83"/>
<col width="82"/>
</colgroup>
<thead>
<tr>
<th class="th_today1">
<span>
날짜
</span>
</th>
<th class="th_today2">
<span>
파실 때
</span>
</th>
<th class="th_today3">
<span>
보내실 때
</span>
</th>
<th class="th_today4">
<span>
받으실 때
</span>
</th>
</tr>
</thead>
<tbody>
<tr class="up">
<td class="date">
2021.07.29
</td>
<td class="num">
1,831.20
</td>
<td class="num">
<img alt="상승" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_up.gif" width="7"/>
31.50
</td>
<td class="num">
+1.75%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.28
</td>
<td class="num">
1,799.70
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/>
0.10
</td>
<td class="num">
0.00%
</td>
</tr>
<tr class="up">
<td class="date">
2021.07.27
</td>
<td class="num">
1,799.80
</td>
<td class="num">
<img alt="상승" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_up.gif" width="7"/>
0.60
</td>
<td class="num">
+0.03%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.26
</td>
<td class="num">
1,799.20
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/>
2.60
</td>
<td class="num">
-0.14%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.23
</td>
<td class="num">
1,801.80
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/>
3.60
</td>
<td class="num">
-0.19%
</td>
</tr>
<tr class="up">
<td class="date">
2021.07.22
</td>
<td class="num">
1,805.40
</td>
<td class="num">
<img alt="상승" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_up.gif" width="7"/>
2.00
</td>
<td class="num">
+0.11%
</td>
</tr>
<tr class="down">
<td class="date">
2021.07.21
</td>
<td class="num">
1,803.40
</td>
<td class="num">
<img alt="하락" height="6" src="https://ssl.pstatic.net/static/nfinance/ico_down.gif" width="7"/>
8.00
</td>
<td class="num">
-0.44%
</td>
</tr>
</tbody>
</table>
<!-- paging -->
<div class="paging">
<a class="on" href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=1" onclick="parent.clickcr(this,'med.pagination','','',event);">
1
</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=2" onclick="parent.clickcr(this,'med.pagination','','',event);">
2
</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=3" onclick="parent.clickcr(this,'med.pagination','','',event);">
3
</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=4" onclick="parent.clickcr(this,'med.pagination','','',event);">
4
</a>
<a href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=5" onclick="parent.clickcr(this,'med.pagination','','',event);">
5
</a>
<a class="next" href="/marketindex/worldDailyQuote.nhn?marketindexCd=CMDT_GC&fdtc=2&page=6" onclick="parent.clickcr(this,'med.pagination','','',event);">
다음
<img alt="" border="0" height="5" src="https://ssl.pstatic.net/static/nfinance/bu_pgarR.gif" width="3"/>
</a>
</div>
</div>
<script type="text/javascript">
</script>
</body>
</html>
- 수업 시간에 봤던 예제처럼태그 안에 데이터가 있기 때문에 td 태그에 있는 모든 데이터를 가져옵니다.
- len으로 리스트 원소의 개수를 알아본다.
In [11]:
td = source.find_all('td')
len(td)
Out[11]:
28
날짜를 추출합니다¶
- 날짜 정보들은 td 태그 중 class가 "date" 인 태그에 있습니다.
- xpath로 접근해보고 추출해보기
- /html/body/div/table/tbody/tr[1]/td[1] = 2021.07.30
In [21]:
# /html/body/div/table/tbody/tr[1]/td[1]
source.find_all('tr')[1].find_all('td')[0]
Out[21]:
<td class="date">
2021.07.29
</td>
- .text로 원하는 데이터를 구했지만 \n, \t가 섞여 있어서 .replace()로 제거합니다
In [29]:
source.find_all('td', class_="date")[1].text
Out[29]:
'\n\t\t\n\t\t2021.07.28\t\t\t\t\n\t\t'
In [31]:
source.find_all('td', class_="date")[0].text.replace('\t','').replace('\n','')
Out[31]:
'2021.07.29'
- 가져온 데이터를 yyyy.mm.dd 형태로 변환합니다.
In [33]:
import datetime as dt
In [32]:
date = source.find_all('td', class_="date")[0].text.replace('\t','').replace('\n','')
yyyy, mm, dd = [int(x) for x in date.split('.')]
yyyy, mm, dd
Out[32]:
(2021, 7, 29)
In [54]:
this_date = dt.date(yyyy, mm, dd)
this_date
Out[54]:
datetime.date(2021, 7, 29)
In [55]:
def date_format(date):
yyyy, mm, dd = [int(x) for x in date.split('.')]
return dt.date(yyyy, mm, dd)
종가를 추출합니다.¶
In [43]:
source.find_all('td', class_='num')[0].text
Out[43]:
'\n\t\t\t\n\t\t\t\t\n\t\t\t\t1,831.20\n\t\t\t\n\t\t'
- 이 데이터도 \n, \t, ' ' 가 섞여 있어서 replace(), strip()로 처리해줍니다.
In [44]:
p = source.find_all('td', class_='num')[0].text.replace('\n','').replace('\t','').strip()
p
Out[44]:
'1,831.20'
- source로 가져온 데이터를 보니 이 데이터도 num 클래스에 종가 말고 다른 데이터가 많습니다.
In [45]:
source.find_all('td', class_='num')[1].text.replace('\n','').replace('\t','').strip()
Out[45]:
'31.50'
In [46]:
source.find_all('td', class_='num')[2].text.replace('\n','').replace('\t','').strip()
Out[46]:
'+1.75%'
In [47]:
source.find_all('td', class_='num')[3].text.replace('\n','').replace('\t','').strip()
Out[47]:
'1,799.70'
In [48]:
source.find_all('td', class_='num')[6].text.replace('\n','').replace('\t','').strip()
Out[48]:
'1,799.80'
- 하나씩 확인해 보니 0, 3, 6, ... 이런 순서로 종가가 나옵니다.
등락률 불러오기¶
In [87]:
source.find_all('td', class_='num')[11].text.replace('\n','').replace('\t','').replace('%','').strip()
Out[87]:
'-0.33'
- 2, 5, 8, 11, ... 순서로 등락률이 나타난다.
In [94]:
prices[i*3-1].text
Out[94]:
'\n\t\t +0.22%\n\t\t'
In [ ]:
- 이제 이런 규칙을 사용해 날짜와 종가와 등락률을 함께 불러옵니다.
In [57]:
dates = source.find_all('td', class_='date')
len(dates)
Out[57]:
7
In [59]:
prices = source.find_all('td', class_='num')
len(prices)
Out[59]:
21
In [102]:
for i in range(len(dates)):
this_date = dates[i].text
this_date = date_format(this_date)
this_close = prices[i*3].text.replace('\n','').replace('\t','').replace(',','')
this_close = float(this_close)
this_ratio = prices[i*3+2].text.replace('\n','').replace('\t','').replace('%','')
this_ratio = float(this_ratio)
print(this_date, this_close, this_ratio)
2021-07-29 1831.2 1.75
2021-07-28 1799.7 0.0
2021-07-27 1799.8 0.03
2021-07-26 1799.2 -0.14
2021-07-23 1801.8 -0.19
2021-07-22 1805.4 0.11
2021-07-21 1803.4 -0.44
100 페이지에 있는 데이터 크롤링¶
In [101]:
index_cd = "CMDT_GC"
page_n = 1 # page number
naver_index = f"https://finance.naver.com/marketindex/worldDailyQuote.nhn?marketindexCd={index_cd}&fdtc=2&page={page_n}"
src = urlopen(naver_index).read()
source = bs4.BeautifulSoup(src, 'lxml')
td = source.find_all('td')
dates = source.find_all('td', class_='date')
prices = source.find_all('td', class_='num')
for i in range(len(dates)):
this_date = dates[i].text.replace('\m', '').replace('\t', '').strip()
this_date =date_format(this_date)
this_close = prices[i*3].text.replace('\n','').replace('\t','').replace(',','')
this_close = float(this_close)
this_ratio = prices[i*3+2].text.replace('\n','').replace('\t','').replace('%','')
this_ratio = float(this_ratio)
print(this_date, this_close, this_ratio)
2021-07-29 1831.2 1.75
2021-07-28 1799.7 0.0
2021-07-27 1799.8 0.03
2021-07-26 1799.2 -0.14
2021-07-23 1801.8 -0.19
2021-07-22 1805.4 0.11
2021-07-21 1803.4 -0.44
위에서 한 작업들 하나로 묶기¶
In [112]:
def crawl_gold_index(index_cd, end_page):
date_list = []
price_list = []
ratio_list = []
for page_n in range(1, end_page+1):
naver_index = f"https://finance.naver.com/marketindex/worldDailyQuote.nhn?marketindexCd={index_cd}&fdtc=2&page={page_n}"
src = urlopen(naver_index).read()
source = bs4.BeautifulSoup(src, 'lxml')
td = source.find_all('td')
dates = source.find_all('td', class_='date')
prices = source.find_all('td', class_='num')
for i in range(len(dates)):
this_date = dates[i].text.replace('\m', '').replace('\t', '').strip()
this_date =date_format(this_date)
this_close = prices[i*3].text.replace('\n','').replace('\t','').replace(',','')
this_close = float(this_close)
this_ratio = prices[i*3+2].text.replace('\n','').replace('\t','').replace('%','')
this_ratio = float(this_ratio)
date_list.append(this_date)
price_list.append(this_close)
ratio_list.append(this_ratio)
df = pd.DataFrame({'날짜' : date_list, "체결가" : price_list, "등락률" : ratio_list})
return df
In [113]:
crawl_gold_index("CMDT_GC", 20)
Out[113]:
날짜 | 체결가 | 등락률 | |
---|---|---|---|
0 | 2021-07-29 | 1831.2 | 1.75 |
1 | 2021-07-28 | 1799.7 | 0.00 |
2 | 2021-07-27 | 1799.8 | 0.03 |
3 | 2021-07-26 | 1799.2 | -0.14 |
4 | 2021-07-23 | 1801.8 | -0.19 |
... | ... | ... | ... |
135 | 2021-01-14 | 1850.3 | -0.17 |
136 | 2021-01-13 | 1853.6 | 0.58 |
137 | 2021-01-12 | 1842.9 | -0.36 |
138 | 2021-01-11 | 1849.6 | 0.84 |
139 | 2021-01-08 | 1834.1 | -4.08 |
140 rows × 3 columns
WTI도 비슷한 구조로 되어 있어서 한번 실험으로 불러와보겠습니다.
In [114]:
crawl_gold_index("OIL_CL", 20)
Out[114]:
날짜 | 체결가 | 등락률 | |
---|---|---|---|
0 | 2021-07-29 | 73.62 | 1.69 |
1 | 2021-07-28 | 72.39 | 1.03 |
2 | 2021-07-27 | 71.65 | -0.36 |
3 | 2021-07-26 | 71.91 | -0.22 |
4 | 2021-07-23 | 72.07 | 0.22 |
... | ... | ... | ... |
135 | 2021-01-14 | 53.57 | 1.24 |
136 | 2021-01-13 | 52.91 | -0.56 |
137 | 2021-01-12 | 53.21 | 1.83 |
138 | 2021-01-11 | 52.25 | 0.01 |
139 | 2021-01-08 | 52.24 | 2.77 |
140 rows × 3 columns
원자재는 아무래도 다 비슷한 구조로 되어있는 것 같습니다.
구리도 한번?
In [116]:
crawl_gold_index("CMDT_CDY", 20)
Out[116]:
날짜 | 체결가 | 등락률 | |
---|---|---|---|
0 | 2021-07-29 | 9781.0 | 0.87 |
1 | 2021-07-28 | 9697.0 | -0.12 |
2 | 2021-07-27 | 9709.0 | 1.35 |
3 | 2021-07-26 | 9580.0 | 1.55 |
4 | 2021-07-23 | 9433.5 | 0.54 |
... | ... | ... | ... |
135 | 2021-01-14 | 8002.5 | 0.53 |
136 | 2021-01-13 | 7960.5 | -0.28 |
137 | 2021-01-12 | 7983.0 | 0.40 |
138 | 2021-01-11 | 7951.5 | -2.39 |
139 | 2021-01-08 | 8146.0 | 1.36 |
140 rows × 3 columns
728x90
반응형
'AI > K-Digital Training' 카테고리의 다른 글
011. 네이버 증권에서 내 주식 뉴스 데이터크롤링 (0) | 2021.08.07 |
---|---|
010. 네이버 증권에서 달러 환율과 내 주식 데이터 크롤링 (0) | 2021.08.06 |
008. json 활용한 뉴스 본문 및 댓글 크롤링 (0) | 2021.08.04 |
007. 로그인 후 웹크롤링 및 셀레니움 활용 웹크롤링 (0) | 2021.08.03 |
006. BeautifulSoup으로 웹크롤링 학습2 (0) | 2021.08.02 |