728x90
Beautifulsoup으로 웹페이지 크롤링¶
In [3]:
from bs4 import BeautifulSoup
import requests
웹페이지에서 html data 가져오기¶
- requests.get()
- soup 개체로 만들기
In [9]:
url = "https://sports.news.naver.com/news?oid=108&aid=0002976247"
resp = requests.get(url)
resp.text
soup = BeautifulSoup(resp.text)
기사 제목 텍스트 가져오기¶
In [11]:
title = soup.find('h4', class_='title')
title.get_text()
Out[11]:
"'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다"
기자 이름 가져오기¶
In [15]:
name = soup.find('div', class_='name')
name.get_text()
Out[15]:
'김명석 기자'
In [31]:
div = soup.find_all('div')
기사 입력 날짜 추출하기¶
In [22]:
info = soup.find('div', class_='info')
info
Out[22]:
<div class="info"> <span>기사입력 2021.07.29. 오전 05:05</span> <span><span class="bar"></span>최종수정 2021.07.29. 오전 05:05</span> <a class="press_link" href="http://star.mt.co.kr/stview.php?no=2021072822520527271" target="_blank">기사원문</a> </div>
In [24]:
info.find('span')
Out[24]:
<span>기사입력 2021.07.29. 오전 05:05</span>
기사 내용 추출하기¶
In [38]:
container = soup.find('div', id='newsEndContents')
contents = ''
for p in container.find_all('p'):
contents += p.get_text().strip()
contents
Out[38]:
'기사제공 스타뉴스김명석 기자 (clear@mtstarnews.com)김명석 기자의 구독을 취소하시겠습니까?구독에서 해당 기자의 기사가 제외됩니다.스타뉴스 김명석 기자입니다.Copyright ⓒ 스타뉴스. All rights reserved. 무단 전재 및 재배포 금지.스포츠 기사 섹션(종목) 정보는 언론사 분류와 기술 기반의 자동 분류 시스템을 따르고 있습니다. 오분류에 대한 건은 네이버스포츠로 제보 부탁드립니다.'
CSS를 이용해 원하는 값 추출¶
In [58]:
url = "https://sports.news.naver.com/news?oid=108&aid=0002976247"
resp = requests.get(url)
resp.text
soup = BeautifulSoup(resp.text)
- 웹페이지의 인터넷 기사를 가져옵니다
css selector 사용¶
- 자손태그 찾기 ('태그 태그')
- 자식태그 찾기 ('태그 > 태그')
In [66]:
soup.select('#newsEndContents a')
Out[66]:
[<a class="link_thumbnail" href="https://media.naver.com/journalist/108/75595" onclick="clickcr(this, 'art.more', '', '', event);"> <div class="thumbnail"> <!-- [D] 이미지 원본 사이즈 : 40x40 --> <img alt="" height="40" onerror="$(this).parent().hide();" src="https://mimgnews.pstatic.net/image/upload/journalist/2021/03/08/KMS.jpg" width="40"/> </div> </a>, <a class="link_press" href="https://media.naver.com/journalist/108/75595" onclick="clickcr(this, 'art.more', '', '', event);"> <div class="press"><img alt="스타뉴스" class="press_img" height="20" onerror="$(this).parent().hide();" src=" https://mimgnews.pstatic.net/image/upload/office_logo/108/2017/01/11/logo_108_6_20170111151211.jpg " title=""/></div> <div class="name">김명석 기자</div> </a>, <a class="link_morenews" href="/news.nhn?oid=108&aid=0002975057" onclick="clickcr(this, 'art.bestart', '', '', event);">토트넘 베테랑 수비수, 카타르로 떠난다... SON→남태희 동료</a>, <a class="link_morenews" href="/news.nhn?oid=108&aid=0002976379" onclick="clickcr(this, 'art.bestart', '', '', event);">5세트 블로킹 포효→서브 에이스, 김연경이 끝냈다 [도쿄올림픽]</a>, <a aria-describedby="wa_categorize_tooltip" class="btn_guide_categorize" href="#wa_categorize_tooltip" role="button">기사 섹션 분류 가이드</a>, <a class="btn_report" href="https://help.naver.com/support/contents/contents.nhn?serviceNo=1001&categoryNo=21210" target="_blank" title="새창">오분류 제보하기</a>, <a class="link_promotion" href="https://star.mt.co.kr/redirectAd.php?id=0&date=20210624031938">스타뉴스 핫이슈</a>, <a class="link_promotion" href="https://star.mt.co.kr/redirectAd.php?id=1&date=20210624031938">생생 스타 현장</a>]
selector로 제목찾기¶
In [68]:
soup.select('h4')
Out[68]:
[<h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>]
In [73]:
soup.select('h4.title')
Out[73]:
[<h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>]
In [78]:
soup.select('h4[class="title"]')
Out[78]:
[<h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>]
In [80]:
soup.select('h4[class^="t"]')
Out[80]:
[<h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>]
In [81]:
soup.select('h4[class$="tle"]')
Out[81]:
[<h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>]
In [82]:
soup.select('h4[class*="l"]')
Out[82]:
[<h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>]
- nth-child 활용해 해당 순서의 원소 찾기
In [93]:
soup.select('span.blind:nth-child(1)')
Out[93]:
[<span class="blind">네이버</span>, <span class="blind">스포츠</span>, <span class="blind">뉴스</span>, <span class="blind">날씨</span>, <span class="blind">TV연예</span>, <span class="blind">TOKYO 2020</span>, <span class="blind">검색</span>, <span class="blind">본문 텍스트 한단계 확대</span>, <span class="blind">본문 텍스트 한단계 축소</span>, <span class="blind">본문 프린트</span>, <span class="blind">닫기</span>, <span class="blind">가이드 닫기</span>, <span class="blind">도움말</span>, <span class="blind">도움말</span>, <span class="blind">도움말 닫기</span>, <span class="blind">재생시간</span>, <span class="blind">재생시간</span>, <span class="blind">재생시간</span>, <span class="blind">재생시간</span>]
In [94]:
soup.select('span.blind')[1]
Out[94]:
<span class="blind">스포츠</span>
In [95]:
import re
In [96]:
soup.find_all('h3')
Out[96]:
[<h3><span class="logo">스타뉴스</span> 주요뉴스<span>해당 언론사에서 선정하며 <em>언론사 페이지(아웃링크)</em>로 이동해 볼 수 있습니다.</span></h3>, <h3 class="title"><span>이 기사를 본</span> 사람들이 본 뉴스</h3>, <h3 class="title">이 시각 많이 본 뉴스</h3>, <h3 class="title">많이 본 영상</h3>, <h3 class="blind">올림픽채널</h3>, <h3 class="title">리그별 득점 순위</h3>, <h3 class="title">PHOTO</h3>, <h3 class="title">PHOTO</h3>, <h3 class="title">PHOTO</h3>, <h3 class="title">PHOTO</h3>]
- tag값 뒤에 다른 숫자가 나오는 tag들 추출
In [99]:
soup.find_all(re.compile('h\d'))
Out[99]:
[<h1 class="logo_group"> <a class="logo_naver" href="https://www.naver.com" onclick="clickcr(this, 'STA.naverlogo', '', '', event);"><span class="blind">네이버</span></a> <a class="logo_sports" href="https://sports.news.naver.com/" onclick="clickcr(this, 'STA.sports', '', '', event);"><span class="blind">스포츠</span></a> </h1>, <h2 class="blind">메인 메뉴</h2>, <h4 class="title">'방출 통보' 토트넘 베테랑, 결국 1년 만에 떠난다</h4>, <h3><span class="logo">스타뉴스</span> 주요뉴스<span>해당 언론사에서 선정하며 <em>언론사 페이지(아웃링크)</em>로 이동해 볼 수 있습니다.</span></h3>, <h3 class="title"><span>이 기사를 본</span> 사람들이 본 뉴스</h3>, <h3 class="title">이 시각 많이 본 뉴스</h3>, <h3 class="title">많이 본 영상</h3>, <h3 class="blind">올림픽채널</h3>, <h3 class="title">리그별 득점 순위</h3>, <h3 class="title">PHOTO</h3>, <h3 class="title">PHOTO</h3>, <h3 class="title">PHOTO</h3>, <h3 class="title">PHOTO</h3>]
- jpg 이미지만 가져오기
In [100]:
soup.find_all('img')
Out[100]:
[<img alt="스타뉴스" height="35" onerror="javascript:setPressLogo('스타뉴스');" src="https://mimgnews.pstatic.net/image/upload/office_logo/108/2017/01/05/logo_108_11_20170105110805.gif"/>, <img alt="" src="https://imgnews.pstatic.net/image/108/2021/07/29/0002976247_001_20210729050508072.jpg?type=w647"/>, <img alt="" height="40" onerror="$(this).parent().hide();" src="https://mimgnews.pstatic.net/image/upload/journalist/2021/03/08/KMS.jpg" width="40"/>, <img alt="스타뉴스" class="press_img" height="20" onerror="$(this).parent().hide();" src=" https://mimgnews.pstatic.net/image/upload/office_logo/108/2017/01/11/logo_108_6_20170111151211.jpg " title=""/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/477/2021/07/29/311782.jpg" width="60"/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/139/2021/07/29/2154211.jpg" width="60"/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/216/2021/07/29/115463.jpg" width="60"/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/343/2021/07/29/106828.jpg" width="60"/>, <img alt="" class="imageLazyLoad" height="156" lazy-src="https://dthumb-phinf.pstatic.net/?twidth=260&theight=156&opts=12&qlt=95&src=https://phinf.pstatic.net/tvcast/20210729_181/GPm3D_16275138324901zOXi_JPEG/1627513828801.jpg" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="260"/>, <img alt="" class="imageLazyLoad" height="64" lazy-src="https://dthumb-phinf.pstatic.net/?twidth=110&theight=64&opts=12&qlt=95&src=https://phinf.pstatic.net/tvcast/20210729_234/bILic_1627515183867TuW7y_JPEG/1627515178301.jpg" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="110"/>, <img alt="" class="imageLazyLoad" height="64" lazy-src="https://dthumb-phinf.pstatic.net/?twidth=110&theight=64&opts=12&qlt=95&src=https://phinf.pstatic.net/tvcast/20210729_254/1yGHw_1627519358274KVjsQ_JPEG/1627519305411.jpg" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="110"/>, <img alt="" class="imageLazyLoad" height="64" lazy-src="https://dthumb-phinf.pstatic.net/?twidth=110&theight=64&opts=12&qlt=95&src=https://phinf.pstatic.net/tvcast/20210729_182/JVMru_1627520450778FBRJX_JPEG/1627520214890.jpg" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="110"/>, <img alt="" class="imageLazyLoad" height="64" lazy-src="https://dthumb-phinf.pstatic.net/?twidth=110&theight=64&opts=12&qlt=95&src=https://phinf.pstatic.net/tvcast/20210729_90/6Eerx_162751570190219LmM_PNG/1627515658941.png" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="110"/>, <img alt="" class="imageLazyLoad" lazy-src="https://sports-phinf.pstatic.net/20180928_260/1538109748561DWCaR_PNG/07_%B1%E2%C5%B8_%C5%D7%B4%CF%BD%BA.png" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="300"/>, <img alt="Britain England Italy Euro 2020 Soccer" class="imageLazyLoad" height="200" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/12/PAP20210712220001055_P2_20210712073016012.jpg&type=nf200_200" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="200"/>, <img alt="Britain England Italy Euro 2020 Soccer" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/12/PAP20210712219901055_P2_20210712073016713.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="BRITAIN SOCCER UEFA EURO 2020" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/091/2021/07/12/PEP20210712140201055_P2_20210712073013086.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="Brazil Argentina Copa America Soccer" class="imageLazyLoad" height="200" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/11/PAP20210711121001055_P2_20210711101314502.jpg&type=nf200_200" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="200"/>, <img alt="BRAZIL SOCCER COPA AMERICA 2021" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/091/2021/07/11/PEP20210711085401055_P2_20210711100914844.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="BRAZIL SOCCER COPA AMERICA 2021" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/091/2021/07/11/PEP20210711085301055_P2_20210711100914158.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="Brazil Colombia Peru Copa America Soccer" class="imageLazyLoad" height="200" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/10/PAP20210710131701055_P2_20210710102214053.jpg&type=nf200_200" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="200"/>, <img alt="Brazil Colombia Peru Copa America Soccer" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/10/PAP20210710131401055_P2_20210710102016462.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="BRAZIL SOCCER COPA AMERICA 2021" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/091/2021/07/10/PEP20210710089401055_P2_20210710101118356.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="Britain England Denmark Euro 2020 Soccer" class="imageLazyLoad" height="200" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/08/PAP20210708186701055_P2_20210708071022169.jpg&type=nf200_200" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="200"/>, <img alt="Britain England Denmark Euro 2020 Soccer" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/08/PAP20210708186001055_P2_20210708071014120.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="Britain England Denmark Euro 2020 Soccer" class="imageLazyLoad" height="100" lazy-src="https://dthumb-phinf.pstatic.net/?src=http://imgnews.naver.net/image/077/2021/07/08/PAP20210708185501055_P2_20210708071013397.jpg&type=nf100_100" onclick="clickcr(this, 'aec*g.photo', '', '', event);" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="100"/>, <img alt="NAVER" height="11" src="https://ssl.pstatic.net/static/common/footer/ci_naver.gif" width="63"/>]
In [105]:
soup.find_all('img', attrs={'src': re.compile('.+\.jpg')})
Out[105]:
[<img alt="" src="https://imgnews.pstatic.net/image/108/2021/07/29/0002976247_001_20210729050508072.jpg?type=w647"/>, <img alt="" height="40" onerror="$(this).parent().hide();" src="https://mimgnews.pstatic.net/image/upload/journalist/2021/03/08/KMS.jpg" width="40"/>, <img alt="스타뉴스" class="press_img" height="20" onerror="$(this).parent().hide();" src=" https://mimgnews.pstatic.net/image/upload/office_logo/108/2017/01/11/logo_108_6_20170111151211.jpg " title=""/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/477/2021/07/29/311782.jpg" width="60"/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/139/2021/07/29/2154211.jpg" width="60"/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/216/2021/07/29/115463.jpg" width="60"/>, <img alt="" height="54" src="https://dthumb-phinf.pstatic.net/?type=sports_nf60_54&sharpen=true&src=https://imgnews.pstatic.net/image/thumb154/343/2021/07/29/106828.jpg" width="60"/>]
- png 파일 가져오기
In [106]:
soup.find_all('img', attrs={'src': re.compile('.+\.png')})
Out[106]:
[]
- class 명이 lind로 끝나는 h3 태그 찾기
In [113]:
soup.find_all('h3', class_=re.compile('.+lind$'))
Out[113]:
[<h3 class="blind">올림픽채널</h3>]
728x90
반응형
'AI > K-Digital Training' 카테고리의 다른 글
008. json 활용한 뉴스 본문 및 댓글 크롤링 (0) | 2021.08.04 |
---|---|
007. 로그인 후 웹크롤링 및 셀레니움 활용 웹크롤링 (0) | 2021.08.03 |
005. BeautifulSoup으로 웹크롤링 학습1 (0) | 2021.08.01 |
004. requests 모듈로 웹 크롤링 학습 (0) | 2021.07.31 |
003. Numpy_Study2 (0) | 2021.07.30 |