빅데이터 스터디 마지막 챕터 - 미니프로젝트 - 넷플릭스 드라마 순위 크롤링 후 분석
2021. 11. 29. 03:11ㆍ빅데이터 스터디
In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
In [2]:
#requests.get()으로 url정보 요청하기
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}
for i in range(1,6): #200위 까지만 추출(1~5page)
url = f'https://flixpatrol.com/calendar/popular/tv-shows/netflix/right-now/{i}/'
r = requests.get(url,headers = headers)
print(r.status_code)
#5개의 200 이 나오면 성공
200 200 200 200 200
In [3]:
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}
netflix_show = []
for i in range(1,6): #200위 까지만 추출(1~5page)
url = f'https://flixpatrol.com/calendar/popular/tv-shows/netflix/right-now/{i}/'
r = requests.get(url,headers = headers)
soup = BeautifulSoup(r.content,'html.parser')
title_info = soup.find_all('tr',{'class':'table-group'})
for info in title_info:
show_title = info.find_all('div',{'class':'group-hover:underline'}) #제목 추출
title = show_title[0].get_text().strip()
show_info = info.find_all('span',{'class':False,'title':False}) #class와 title속성이 없는 것만 추출
nation = show_info[0].text
date = show_info[1].text
genre = show_info[3].text
netflix_show.append([title,nation,date,genre])
In [4]:
flix_df = pd.DataFrame(netflix_show,
columns = ['title','nation','date','genre'])
flix_df
Out[4]:
title | nation | date | genre | |
---|---|---|---|---|
0 | Lucifer | United States | 01/25/2016 | Superhero |
1 | The Queen's Gambit | United States | 10/23/2020 | Drama |
2 | Squid Game | South Korea | 09/17/2021 | Mystery |
3 | Money Heist | Spain | 05/02/2017 | Action |
4 | Lupin | United States | 01/08/2021 | Crime |
... | ... | ... | ... | ... |
195 | Invisible City | Canada | 02/05/2021 | Mystery |
196 | Do Do Sol Sol La La Sol | South Korea | 10/07/2020 | Comedy |
197 | Dare Me | United States | 12/29/2019 | Crime |
198 | Resident Evil: Infinite Darkness | United States | 07/08/2021 | Animation |
199 | This is a Robbery: The World's Biggest Art Heist | United States | 04/07/2021 | Documentary |
200 rows × 4 columns
In [5]:
#언패킹 ,데이터를 처음 가져왔을 경우, 데이터의 크기 알아보기
(rows,columns) = flix_df.shape
print(f'rows : {rows}')
print(f'coulmns : {columns}')
#200위 까지 성공적으로 데이터 프레임이 만들어짐
rows : 200 coulmns : 4
In [6]:
flix_df.head() #앞 5개의 데이터
Out[6]:
title | nation | date | genre | |
---|---|---|---|---|
0 | Lucifer | United States | 01/25/2016 | Superhero |
1 | The Queen's Gambit | United States | 10/23/2020 | Drama |
2 | Squid Game | South Korea | 09/17/2021 | Mystery |
3 | Money Heist | Spain | 05/02/2017 | Action |
4 | Lupin | United States | 01/08/2021 | Crime |
In [7]:
flix_df.tail() #뒤 5개의 데이터
Out[7]:
title | nation | date | genre | |
---|---|---|---|---|
195 | Invisible City | Canada | 02/05/2021 | Mystery |
196 | Do Do Sol Sol La La Sol | South Korea | 10/07/2020 | Comedy |
197 | Dare Me | United States | 12/29/2019 | Crime |
198 | Resident Evil: Infinite Darkness | United States | 07/08/2021 | Animation |
199 | This is a Robbery: The World's Biggest Art Heist | United States | 04/07/2021 | Documentary |
In [8]:
flix_df[flix_df['nation']=='South Korea'] # 한국 tvshow만 가져오기
Out[8]:
title | nation | date | genre | |
---|---|---|---|---|
2 | Squid Game | South Korea | 09/17/2021 | Mystery |
20 | Vincenzo | South Korea | 02/20/2021 | Crime |
37 | Hometown Cha-Cha-Cha | South Korea | 08/28/2021 | Comedy |
41 | It's Okay to Not Be Okay | South Korea | 06/20/2020 | Drama |
45 | Hospital Playlist | South Korea | 03/12/2020 | Drama |
53 | My Name | South Korea | 10/15/2021 | Mystery |
63 | Nevertheless | South Korea | 06/19/2021 | Drama |
64 | Crash Landing on You | South Korea | 12/14/2019 | Drama |
65 | The King: Eternal Monarch | South Korea | 04/17/2020 | Science Fiction |
70 | Hellbound | South Korea | 11/19/2021 | Crime |
84 | Sisyphus: The Myth | South Korea | 02/17/2021 | Science Fiction |
87 | Sweet Home | South Korea | 12/18/2020 | Horror |
95 | The King's Affection | South Korea | 10/11/2021 | Drama |
112 | Was It Love | South Korea | 07/08/2020 | Comedy |
116 | Law School | South Korea | 04/14/2021 | Crime |
130 | Mystic Pop-Up Bar | South Korea | 05/20/2020 | Mystery |
140 | Love Marriage Divorce | South Korea | 01/23/2021 | Drama |
141 | Itaewon Class | South Korea | 01/31/2020 | Drama |
161 | Private Lives | South Korea | 10/07/2020 | Crime |
169 | Love Alarm | South Korea | 08/22/2019 | Drama |
194 | Move to Heaven | South Korea | 05/14/2021 | Drama |
196 | Do Do Sol Sol La La Sol | South Korea | 10/07/2020 | Comedy |
In [9]:
flix_df['nation'].mode() #200위안의 가장 많은 국가
Out[9]:
0 United States dtype: object
In [10]:
#데이터를 그룹으로 묶어 분석
n = flix_df.groupby('nation')
n.size()
Out[10]:
nation Australia 1 Austria 1 Belgium 2 Brazil 1 Canada 5 Colombia 4 Denmark 2 Egypt 1 France 3 Germany 6 Iceland 1 Italy 3 Japan 4 Jordan 1 Luxembourg 1 Mexico 8 Norway 2 Poland 2 Russia 1 South Africa 1 South Korea 22 Spain 11 Sweden 3 Thailand 1 Turkey 4 United Kingdom 13 United States 96 dtype: int64
In [11]:
#가장 많은 국가 순 대로 정렬하기
n = flix_df.groupby('nation')
n.size().sort_values(ascending=False)
Out[11]:
nation United States 96 South Korea 22 United Kingdom 13 Spain 11 Mexico 8 Germany 6 Canada 5 Colombia 4 Turkey 4 Japan 4 France 3 Italy 3 Sweden 3 Poland 2 Norway 2 Belgium 2 Denmark 2 South Africa 1 Thailand 1 Australia 1 Russia 1 Luxembourg 1 Austria 1 Iceland 1 Egypt 1 Brazil 1 Jordan 1 dtype: int64
In [12]:
flix_df.to_csv('netflix_rank_nov_29th.csv',index=False , encoding='UTF-8') #csv 파일 만들기
In [13]:
pd.read_csv('netflix_rank_nov_29th.csv') #저장된 파일 불러오기
Out[13]:
title | nation | date | genre | |
---|---|---|---|---|
0 | Lucifer | United States | 01/25/2016 | Superhero |
1 | The Queen's Gambit | United States | 10/23/2020 | Drama |
2 | Squid Game | South Korea | 09/17/2021 | Mystery |
3 | Money Heist | Spain | 05/02/2017 | Action |
4 | Lupin | United States | 01/08/2021 | Crime |
... | ... | ... | ... | ... |
195 | Invisible City | Canada | 02/05/2021 | Mystery |
196 | Do Do Sol Sol La La Sol | South Korea | 10/07/2020 | Comedy |
197 | Dare Me | United States | 12/29/2019 | Crime |
198 | Resident Evil: Infinite Darkness | United States | 07/08/2021 | Animation |
199 | This is a Robbery: The World's Biggest Art Heist | United States | 04/07/2021 | Documentary |
200 rows × 4 columns
'빅데이터 스터디' 카테고리의 다른 글
웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석 - 4 (Pandas로 데이터 분석하기) (0) | 2021.11.13 |
---|---|
웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석-3 (실전 크롤링) (0) | 2021.11.08 |
웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석-2 (웹 크롤링 연습) (0) | 2021.11.08 |
웹크롤링과 데이터분석 : 전세계 축구 선수 몸값 분석-1 (기초 개념) (0) | 2021.11.08 |
프로세스 마이닝을 활용한 고객여정분석 - 4 (0) | 2021.10.28 |