728x90
웹크롤링을 위한 로그인하기 학습¶
In [1]:
import requests
import json
from bs4 import BeautifulSoup
다음 뉴스의 댓글 개수 크롤링하기¶
- 댓글 개수를 불러오는 XHR header 불러오기
- json으로 변환해서 댓글 개수에 해당하는 value 불러오기
In [2]:
url = 'https://comment.daum.net/apis/v1/ui/single/main/@20210729173132818'
headers = {
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJmb3J1bV9rZXkiOiJuZXdzIiwiZ3JhbnRfdHlwZSI6ImFsZXhfY3JlZGVudGlhbHMiLCJzY29wZSI6W10sImV4cCI6MTYyNzU5ODE5OCwiYXV0aG9yaXRpZXMiOlsiUk9MRV9DTElFTlQiXSwianRpIjoiMTVlMjE2ZTgtYzRkNS00MjhhLTk0MDktY2Q1ZTU2OTU0NDBlIiwiZm9ydW1faWQiOi05OSwiY2xpZW50X2lkIjoiMjZCWEF2S255NVdGNVowOWxyNWs3N1k4In0.9PKdRWr61j0GwNsVTWOzuGaw7puxrCeMsXYqeVxzxUg',
'Origin': 'https://news.v.daum.net',
'Referer': 'https://news.v.daum.net/v/20210729173132818',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
resp = requests.get(url, headers=headers)
print(resp)
<Response [200]>
json으로 데이터타입 Dict로 만들기¶
In [3]:
resp.json()
Out[3]:
{'post': {'id': 158972971,
'forumId': -99,
'postKey': '20210729173132818',
'flags': 0,
'title': "LG화학, LG전자 분리막사업 인수.. '세계 유일' 배터리 4대 핵심 소재 기술 확보",
'url': 'https://news.v.daum.net/v/k5CU1DehIh',
'icon': 'https://img1.daumcdn.net/thumb/S1200x630/?fname=https://t1.daumcdn.net/news/202107/29/donga/20210729173134301lhri.jpg',
'commentCount': 98,
'childCount': 18,
'popularOpened': True}}
key 값으로 value 추출하기¶
In [4]:
resp.json()['post']['commentCount']
Out[4]:
98
로그인 후 데이터 크롤링하기¶
- endpoint 찾기
- 로그인 후에 개발자 도구에서 엔드 포인트를 찾을 수 있다.
- 디시인사이드에 접속해서 남은 만두 확인
In [5]:
url = 'https://dcid.dcinside.com/join/member_check.php'
- 로그인 데이터 작성
- 아이디랑 비밀번호는 잠깐 지웠습니당 ^^
In [6]:
data = {
's_url': '//www.dcinside.com/',
'ssl': 'Y',
'juOB2z6U5A27W9wN': &##39;V0EY4I6b00vJiC8R',
'user_id' : '',
'pw' : ''
}
headers = {
'Referer': 'https://www.dcinside.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
In [7]:
response = requests.post(url, headers=headers, data=data)
print(response.text)
<!DOCTYPE html>
<html lang="ko">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
<meta name="content-language" content="kr">
<meta name="google-site-verification" content="8_SyZg2Wg3LNnCmFXzETp7ld4yjZB8ny17m8QsYsLwk">
<meta name="author" content="디시인사이드">
<meta name="title" content="We are with you all the way! IT is Life! 디시인사이드 입니다.">
<meta name="description" content="디시인사이드 로그인">
<meta property="og:type" content="website">
<meta property="og:title" content="디시인사이드">
<meta property="og:description" content="디시인사이드 로그인">
<meta property="og:image" content="http://nstatic.dcinside.com/dc/w/images/descrip_img.png">
<meta property="og:url" content="http://www.dcinside.com/">
<title>디시인사이드</title>
<link rel="shortcut icon" href="//nstatic.dcinside.com/dc/w/images/logo_icon.ico"/>
<link rel="stylesheet" type="text/css" href="//nstatic.dcinside.com/dc/w/css/reset.css"/>
<link rel="stylesheet" type="text/css" href="//nstatic.dcinside.com/dc/w/css/login.css"/>
<link rel="stylesheet" type="text/css" href="//nstatic.dcinside.com/dc/w/css/common.css"/>
<link rel="stylesheet" type="text/css" href="//nstatic.dcinside.com/dc/w/css/popup.css"/>
<script type="text/javascript" src="https://nstatic.dcinside.com/dc/w/js/html5shiv.min.js"></script>
<!--[if IE 7]>
<link rel="stylesheet" type="text/css" href="http://nstatic.dcinside.com/dc/w/css/ie7.css"/>
<![endif]-->
<script type="text/javascript" src="js/jquery.js"></script>
<script type="text/javascript" src="js/member_.js?201807191401"></script>
<script type="text/javascript">
<!--
//document.domain = "dcinside.com";
//-->
function CheckMandu() {
var myMandu = 0;
if(myMandu > 0) {
alert("남은 만두가 "+myMandu+"개 있습니다. \n탈퇴시 만두는 복구되지 않습니다.");
}
if('' == 1) {
alert('만두선물 혹은 확인 내역이 있습니다. \n탈퇴시 선물을 받으실 수 없습니다.');
}
}
</script>
</head>
</head><body>
<html><head><meta http-equiv="refresh" content="0; url=//www.dcinside.com/"></head><body></body></html>
In [8]:
session = requests.Session()
session.post(url, headers=headers, data=data)
response = session.get("https://www.dcinside.com/")
In [9]:
soup = BeautifulSoup(response.text)
soup.select('span.num')
Out[9]:
[<span class="num">[96]</span>,
<span class="num">[98]</span>,
<span class="num">[229]</span>,
<span class="num">[90]</span>,
<span class="num">[408]</span>,
<span class="num">[85]</span>,
<span class="num">[336]</span>,
<span class="num">[213]</span>,
<span class="num">[850]</span>,
<span class="num">[339]</span>,
<span class="num">[195]</span>,
<span class="num">[309]</span>,
<span class="num">[752]</span>,
<span class="num">[162]</span>,
<span class="num">[378]</span>,
<span class="num">[486]</span>,
<span class="num">[815]</span>,
<span class="num">[715]</span>,
<span class="num">[493]</span>,
<span class="num">[167]</span>,
<span class="num">[343]</span>,
<span class="num">[186]</span>,
<span class="num">[88]</span>,
<span class="num">[651]</span>,
<span class="num">[302]</span>,
<span class="num">[188]</span>,
<span class="num">[1352]</span>,
<span class="num">[455]</span>,
<span class="num">[380]</span>,
<span class="num">[637]</span>,
<span class="num">[353]</span>,
<span class="num">[367]</span>,
<span class="num">[410]</span>,
<span class="num">[501]</span>,
<span class="num">[487]</span>,
<span class="num">[267]</span>,
<span class="num">[511]</span>,
<span class="num">[588]</span>,
<span class="num">[99]</span>,
<span class="num">[637]</span>,
<span class="num">[496]</span>,
<span class="num">[367]</span>,
<span class="num">[564]</span>,
<span class="num">[242]</span>,
<span class="num">[763]</span>,
<span class="num">[826]</span>,
<span class="num">[342]</span>,
<span class="num">[400]</span>,
<span class="num">[591]</span>,
<span class="num">[167]</span>,
<span class="num">[771]</span>,
<span class="num">[829]</span>,
<span class="num">[587]</span>,
<span class="num">[163]</span>,
<span class="num">[574]</span>,
<span class="num">[236]</span>,
<span class="num">[406]</span>,
<span class="num">[561]</span>,
<span class="num">[486]</span>,
<span class="num">[227]</span>,
<span class="num">[229]</span>,
<span class="num">[436]</span>,
<span class="num">[1250]</span>,
<span class="num">[1202]</span>,
<span class="num">[275]</span>,
<span class="num">[580]</span>,
<span class="num">[310]</span>,
<span class="num">[645]</span>,
<span class="num">[1025]</span>,
<span class="num">[427]</span>,
<span class="num">[771]</span>,
<span class="num">[583]</span>,
<span class="num">[305]</span>,
<span class="num">[367]</span>,
<span class="num">[575]</span>,
<span class="num">0개</span>]
In [10]:
soup.select('span.num')[-1]
Out[10]:
<span class="num">0개</span>
In [ ]:
selenium¶
- 웹페이지 테스트 자동화 모듈
In [6]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
- python.org에서 자동으로 검색해보기
In [16]:
chrome_driver = '../chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)
driver.get('https://www.python.org')
search = driver.find_element_by_id('id-search-field')
search.clear()
time.sleep(1)
search.send_keys('lambda')
time.sleep(1)
search.send_keys(Keys.RETURN)
time.sleep(1)
driver.close()
selenium을 이용해 다음 뉴스 웹 크롤링¶
- 댓글 개수 가져오기
In [22]:
chrome_driver = '../chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)
driver.get('https://news.v.daum.net/v/20210730101735040')
src = driver.page_source
soup = BeautifulSoup(src)
driver.close()
comment = soup.select_one('span.alex-count-area')
comment.get_text()
Out[22]:
'639'
naver news 댓글개수 크롤링¶
- 로딩이 될때까지 기다린 다음에 크롤링
In [25]:
chrome_driver = '../chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)
driver.get('https://finance.naver.com/item/news_read.nhn?article_id=0004678667&office_id=014&code=263750&sm=title_entity_id.basic')
# selctor의 데이터가 로딩될 때까지 기다리기
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'span.u_cbox_count')))
src = driver.page_source
soup = BeautifulSoup(src)
driver.close()
comment = soup.select_one('span.u_cbox_count')
comment.get_text()
Out[25]:
'10'
728x90
반응형
'AI > K-Digital Training' 카테고리의 다른 글
009. 네이버 금융에서 원자재 시세 데이터 크롤링하기 (0) | 2021.08.05 |
---|---|
008. json 활용한 뉴스 본문 및 댓글 크롤링 (0) | 2021.08.04 |
006. BeautifulSoup으로 웹크롤링 학습2 (0) | 2021.08.02 |
005. BeautifulSoup으로 웹크롤링 학습1 (0) | 2021.08.01 |
004. requests 모듈로 웹 크롤링 학습 (0) | 2021.07.31 |