웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석-2 (웹 크롤링 연습)

2021. 11. 8. 23:47ㆍ빅데이터 스터디

crawling

Crawling 기초 실습¶

Requests 실습¶

In [5]:

#requests 불러오기
import requests

In [6]:

#headers에 'User-Agent' 값 넣기
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}

#url에 주소 넣기
url = "https://www.transfermarkt.com/"

#requests.get()으로 요청하기
r = requests.get(url, headers=headers)
r.status_code
#200이 나오면 성공

Out[6]:

BeautifulSoup Quick start 실습¶

In [7]:

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

In [8]:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')

In [11]:

#p태그 정보 가져오기(처음 나오는 것 한개)
soup.p

Out[11]:

<p class="title"><b>The Dormouse's story</b></p>

In [12]:

#두번째 방법
soup.find('p')

Out[12]:

<p class="title"><b>The Dormouse's story</b></p>

In [14]:

#a 태그에 있는'href' 속성값 가져오기(처음 나오는 것 한개)
soup.find('a')['href']

Out[14]:

'http://example.com/elsie'

In [15]:

#두번째 방법
soup.a['href']

Out[15]:

'http://example.com/elsie'

In [17]:

#태그에 있는 텍스트 가져오기(처음 나오는 것 한개)
soup.find('a').text

Out[17]:

'Elsie'

In [18]:

#두번째 방법
soup.find('a').get_text()

Out[18]:

'Elsie'

In [19]:

#a 태그에 있는 요소들 모두 가져오기
soup.find_all('a')

Out[19]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [20]:

# 두번째 a태그에 있는 정보 가져오기 (인덱싱)
soup.find_all('a')[1]

Out[20]:

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

In [22]:

#a태그에 있는 'href' 속성값 모두 가져오기(반복문 필수)
a_list = soup.find_all('a')
for i in a_list:
    print(i['href'])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

In [23]:

#a 태그에 있는 텍스트 모두 가져오기
a_list = soup.find_all('a')
for i in a_list:
    print(i.text)

Elsie
Lacie
Tillie

In [24]:

#태그와 속성값을 같이 넣어 찾아오기
# a 태그 이면서 class 가 sister인 값 찾아오기

soup.find_all('a',class_ = 'sister')

Out[24]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [25]:

#a 태그이면서 id가 link3인 요소들 모두 찾기
soup.find_all('a',id = 'link3')

Out[25]:

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [26]:

#두번째 방법 (딕셔너리 형태)
soup.find_all('a',{'id':'link3'})

Out[26]:

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [27]:

#_안붙여도됨
soup.find_all('a',{'class':'sister'})

Out[27]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

'빅데이터 스터디' 카테고리의 다른 글

웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석 - 4 (Pandas로 데이터 분석하기) (0)	2021.11.13
웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석-3 (실전 크롤링) (0)	2021.11.08
웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석-1 (기초 개념) (0)	2021.11.08
프로세스 마이닝을 활용한 고객여정분석 - 4 (0)	2021.10.28
프로세스 마이닝을 이용한 고객여정 분석-3 (0)	2021.10.28

개발 블로그

개발 블로그

태그

최근글

댓글

공지사항

아카이브

Crawling 기초 실습¶

Requests 실습¶

BeautifulSoup Quick start 실습¶

'빅데이터 스터디' 카테고리의 다른 글

관련글

티스토리툴바