Pandas 데이터분석 기초 실습-6
2021. 9. 22. 17:20ㆍ빅데이터 스터디
In [1]:
import pandas as pd
In [2]:
student_list = [
{'name':'John','major':'computer science','sex':'male'},
{'name':'Nate','major':'computer science','sex':'male'},
{'name':'Abraham','major':'Physics','sex':'male'},
{'name':'Brian','major':'Psychology','sex':'male'},
{'name':'Janny','major':'Economics','sex':'female'},
{'name':'Yuna','major':'Economics','sex':'female'},
{'name':'Jeniffer','major':'computer science','sex':'female'},
{'name':'Edward','major':'computer science','sex':'male'},
{'name':'Zara','major':'Psycology','sex':'female'},
{'name':'Wendy','major':'Economics','sex':'female'},
{'name':'Sera','major':'Psycology','sex':'female'},
]
df = pd.DataFrame(student_list, columns=['name','major','sex'])
df
Out[2]:
name | major | sex | |
---|---|---|---|
0 | John | computer science | male |
1 | Nate | computer science | male |
2 | Abraham | Physics | male |
3 | Brian | Psychology | male |
4 | Janny | Economics | female |
5 | Yuna | Economics | female |
6 | Jeniffer | computer science | female |
7 | Edward | computer science | male |
8 | Zara | Psycology | female |
9 | Wendy | Economics | female |
10 | Sera | Psycology | female |
In [21]:
#학과 별로 몇명있는지 알기
groupby_major = df.groupby('major')
groupby_major.groups
Out[21]:
{'Economics': [4, 5, 9], 'Physics': [2], 'Psychology': [3], 'Psycology': [8, 10], 'computer science': [0, 1, 6, 7]}
In [19]:
for name, group in groupby_major:
print(name+' : '+str(len(group)))
print(group)
print()
Economics : 3 name major sex 4 Janny Economics female 5 Yuna Economics female 9 Wendy Economics female Physics : 1 name major sex 2 Abraham Physics male Psychology : 1 name major sex 3 Brian Psychology male Psycology : 2 name major sex 8 Zara Psycology female 10 Sera Psycology female computer science : 4 name major sex 0 John computer science male 1 Nate computer science male 6 Jeniffer computer science female 7 Edward computer science male
In [24]:
df_major_cnt =pd.DataFrame( {'count':groupby_major.size()} ).reset_index()
df_major_cnt
#major를 column으로 만들기 .reset_index()
Out[24]:
major | count | |
---|---|---|
0 | Economics | 3 |
1 | Physics | 1 |
2 | Psychology | 1 |
3 | Psycology | 2 |
4 | computer science | 4 |
In [25]:
groupby_sex = df.groupby('sex')
In [28]:
for name, group in groupby_sex:
print(name + ' : '+ str(len(group)))
print(group)
print()
#성별로 나눈 그룹
female : 6 name major sex 4 Janny Economics female 5 Yuna Economics female 6 Jeniffer computer science female 8 Zara Psycology female 9 Wendy Economics female 10 Sera Psycology female male : 5 name major sex 0 John computer science male 1 Nate computer science male 2 Abraham Physics male 3 Brian Psychology male 7 Edward computer science male
In [29]:
student_list =[
{
'name':'John',
'major':'computer science',
'sex':'male'
},
{
'name':'Nate',
'major':'computer science',
'sex':'male'
},
{
'name':'Edward',
'major':'computer science',
'sex':'male'
},
{
'name':'Zara',
'major':'psychology',
'sex':'female'
},
{
'name':'John',
'major':'computer science',
'sex':'male'
}
]
df = pd.DataFrame(student_list,columns= ['name','major','sex'])
df
Out[29]:
name | major | sex | |
---|---|---|---|
0 | John | computer science | male |
1 | Nate | computer science | male |
2 | Edward | computer science | male |
3 | Zara | psychology | female |
4 | John | computer science | male |
In [31]:
df.duplicated() #중복여부확인
#John이 중복
Out[31]:
0 False 1 False 2 False 3 False 4 True dtype: bool
In [32]:
df.drop_duplicates() #중복제거
Out[32]:
name | major | sex | |
---|---|---|---|
0 | John | computer science | male |
1 | Nate | computer science | male |
2 | Edward | computer science | male |
3 | Zara | psychology | female |
In [34]:
student_list = [
{'name':'John',
'major':'computer science',
'sex':'male'},
{'name':'Nate',
'major':'computer science',
'sex':'male'},
{'name':'Edward',
'major':'computer science',
'sex':'male'},
{'name':'Zara',
'major':'Psycology',
'sex':'female'
},
{'name':'Wendy',
'major':'economics',
'sex':'female'},
{'name':'Nate',
'major':None,
'sex':'male'},
{'name':'John',
'major':'economics',
'sex':'male'}
]
df = pd.DataFrame(student_list,columns = ['name','major','sex'])
df
Out[34]:
name | major | sex | |
---|---|---|---|
0 | John | computer science | male |
1 | Nate | computer science | male |
2 | Edward | computer science | male |
3 | Zara | Psycology | female |
4 | Wendy | economics | female |
5 | Nate | None | male |
6 | John | economics | male |
In [36]:
df.duplicated() #정확하게 일치하는 row값이 없으므로 False만 나온다
Out[36]:
0 False 1 False 2 False 3 False 4 False 5 False 6 False dtype: bool
In [39]:
df.duplicated(['name']) #이름값이 같을경우 같은 값으로 보겠다
Out[39]:
0 False 1 False 2 False 3 False 4 False 5 True 6 True dtype: bool
In [40]:
df.drop_duplicates(['name'],keep = 'first') #처음 나오는 값만 유지(default)
Out[40]:
name | major | sex | |
---|---|---|---|
0 | John | computer science | male |
1 | Nate | computer science | male |
2 | Edward | computer science | male |
3 | Zara | Psycology | female |
4 | Wendy | economics | female |
In [41]:
df.drop_duplicates(['name'],keep = 'last') #나중에 나오는 값을 유지
Out[41]:
name | major | sex | |
---|---|---|---|
2 | Edward | computer science | male |
3 | Zara | Psycology | female |
4 | Wendy | economics | female |
5 | Nate | None | male |
6 | John | economics | male |
데이터 프레임 NaN값을 발견하고 변경하기¶
In [44]:
import pandas as pd
school_id_list =[
{'name':'John','job':'teacher','age':40},
{'name':'Nate','job':'teacher','age':35},
{'name':'Yuna','job':'teacher','age':37},
{'name':'Abraham','job':'student','age':10},
{'name':'Brian','job':'student','age':12},
{'name':'Janny','job':'student','age':11},
{'name':'Nate','job':'teacher','age':None},
{'name':'John','job':'student','age':None}
]
df = pd.DataFrame(school_id_list,columns = ['name','job','age'])
df
Out[44]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | NaN |
7 | John | student | NaN |
In [46]:
df.shape #8row 3column
Out[46]:
(8, 3)
In [47]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 8 non-null object 1 job 8 non-null object 2 age 6 non-null float64 dtypes: float64(1), object(2) memory usage: 320.0+ bytes
In [48]:
df.isna()
Out[48]:
name | job | age | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | False | False |
3 | False | False | False |
4 | False | False | False |
5 | False | False | False |
6 | False | False | True |
7 | False | False | True |
In [51]:
df.isnull() #isna와 같은기능
Out[51]:
name | job | age | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | False | False |
3 | False | False | False |
4 | False | False | False |
5 | False | False | False |
6 | False | False | True |
7 | False | False | True |
In [52]:
#NaN을 0으로 바꾸기
df.age = df.age.fillna(0)
df
Out[52]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | 0.0 |
7 | John | student | 0.0 |
In [55]:
school_id_list =[
{'name':'John','job':'teacher','age':40},
{'name':'Nate','job':'teacher','age':35},
{'name':'Yuna','job':'teacher','age':37},
{'name':'Abraham','job':'student','age':10},
{'name':'Brian','job':'student','age':12},
{'name':'Janny','job':'student','age':11},
{'name':'Nate','job':'teacher','age':None},
{'name':'John','job':'student','age':None}
]
df = pd.DataFrame(school_id_list,columns = ['name','job','age'])
df
Out[55]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | NaN |
7 | John | student | NaN |
In [57]:
#medium 값으로 변경 (추천)
df['age'].fillna(df.groupby('job')['age'].transform('median'),inplace = True)
df
Out[57]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | 37.0 |
7 | John | student | 11.0 |
'빅데이터 스터디' 카테고리의 다른 글
데이터 시각화와 차트분석 기법 (0) | 2021.09.24 |
---|---|
Pandas 데이터분석 기초 실습-7 (0) | 2021.09.22 |
Pandas 데이터분석 기초 실습 -5 (0) | 2021.09.22 |
Pandas 데이터분석 기초 실습 -4 (0) | 2021.09.22 |
Pandas 데이터분석 기초실습 -3 (0) | 2021.09.22 |