Pandas 데이터분석 기초 실습-7
2021. 9. 22. 17:22ㆍ빅데이터 스터디
팬더스 apply 함수 다양한 활용 방법¶
In [1]:
import pandas as pd
date_list = [{'yyyy-mm-dd':'2000-06-27'},
{'yyyy-mm-dd':'2002-09-24'},
{'yyyy-mm-dd':'2005-12-20'}]
df = pd.DataFrame(date_list,columns=['yyyy-mm-dd'])
df
Out[1]:
yyyy-mm-dd | |
---|---|
0 | 2000-06-27 |
1 | 2002-09-24 |
2 | 2005-12-20 |
In [2]:
#column 추가하기 (year)
def extract_year(column):
return column.split('-')[0]
In [3]:
df['year'] = df['yyyy-mm-dd'].apply(extract_year)
df
Out[3]:
yyyy-mm-dd | year | |
---|---|---|
0 | 2000-06-27 | 2000 |
1 | 2002-09-24 | 2002 |
2 | 2005-12-20 | 2005 |
In [4]:
#parameter 넣기
def get_age(year,current_year):
return current_year - int(year)
In [5]:
df['age'] = df['year'].apply(get_age,current_year = 2018)
df
#키워드 인수로 넣기
Out[5]:
yyyy-mm-dd | year | age | |
---|---|---|---|
0 | 2000-06-27 | 2000 | 18 |
1 | 2002-09-24 | 2002 | 16 |
2 | 2005-12-20 | 2005 | 13 |
In [6]:
def get_introduce(age,prefix,suffix):
return prefix + str(age) + suffix
In [7]:
df['introduce'] = df['age'].apply(get_introduce,prefix ='I am ',suffix = ' years old.')
df
Out[7]:
yyyy-mm-dd | year | age | introduce | |
---|---|---|---|---|
0 | 2000-06-27 | 2000 | 18 | I am 18 years old. |
1 | 2002-09-24 | 2002 | 16 | I am 16 years old. |
2 | 2005-12-20 | 2005 | 13 | I am 13 years old. |
In [8]:
#여러개의 column값을 apply function사용
def get_introduce_2(row):
return 'I was born in '+ str(row.year) + ' my age is '+ str(row.age)
In [9]:
df.introduce = df.apply(get_introduce_2, axis = 1) # == df['introduce']
# axis = 1을 활용함으로 써 row 에 있는 모든 데이터 활용
df
Out[9]:
yyyy-mm-dd | year | age | introduce | |
---|---|---|---|---|
0 | 2000-06-27 | 2000 | 18 | I was born in 2000 my age is 18 |
1 | 2002-09-24 | 2002 | 16 | I was born in 2002 my age is 16 |
2 | 2005-12-20 | 2005 | 13 | I was born in 2005 my age is 13 |
map, applymap 함수활용¶
In [10]:
import pandas as pd
date_list = [{'date':'2000-06-27'},
{'date':'2002-09-24'},
{'date':'2005-12-20'}]
df = pd.DataFrame(date_list,columns=['date'])
df
Out[10]:
date | |
---|---|
0 | 2000-06-27 |
1 | 2002-09-24 |
2 | 2005-12-20 |
In [11]:
def extract_year(date):
return date.split('-')[0]
In [12]:
df['year'] = df['date'].map(extract_year)
# 연도 만 끄집어 내서 date column 만들기
df
Out[12]:
date | year | |
---|---|---|
0 | 2000-06-27 | 2000 |
1 | 2002-09-24 | 2002 |
2 | 2005-12-20 | 2005 |
In [13]:
job_list = [{'age':20,'job':'student'},
{'age':30,'job':'developer'},
{'age':30,'job':'teacher'}]
df = pd.DataFrame(job_list)
df
Out[13]:
age | job | |
---|---|---|
0 | 20 | student |
1 | 30 | developer |
2 | 30 | teacher |
In [14]:
#머신러닝의 경우 string을 숫자로 바꿔 줘야할 경우가 생긴다
df.job = df.job.map({'student':1,'developer':2,'teacher':3})
#딕셔너리 전달을 통한 변경
df
Out[14]:
age | job | |
---|---|---|
0 | 20 | 1 |
1 | 30 | 2 |
2 | 30 | 3 |
In [15]:
x_y = [{'x':5.5,'y':-5.6,'z':1.1},
{'x':-5.2,'y':5.5,'z':-2.2},
{'x':-1.6,'y':-4.5,'z':-3.3}]
df =pd.DataFrame(x_y)
df
Out[15]:
x | y | z | |
---|---|---|---|
0 | 5.5 | -5.6 | 1.1 |
1 | -5.2 | 5.5 | -2.2 |
2 | -1.6 | -4.5 | -3.3 |
In [16]:
import numpy as np
In [17]:
df = df.applymap(np.around) #반올림, 값의 일괄적인 적용
df
Out[17]:
x | y | z | |
---|---|---|---|
0 | 6.0 | -6.0 | 1.0 |
1 | -5.0 | 6.0 | -2.0 |
2 | -2.0 | -4.0 | -3.0 |
In [18]:
import pandas as pd
job_list = [{'name':'John','job':'teacher'},
{'name':'Nate','job':'teacher'},
{'name':'Fred','job':'teacher'},
{'name':'Abraham','job':'student'},
{'name':'Brian','job':'student'},
{'name':'Janny','job':'developer'},
{'name':'Nate','job':'teacher'},
{'name':'Ian','job':'teacher'},
{'name':'Chris','job':'banker'},
{'name':'Phillip','job':'lawyer'},
{'name':'Phillip','job':'basketball player'},
{'name':'Gwen','job':'teacher'},
{'name':'Jessy','job':'student'}]
df = pd.DataFrame(job_list,columns =['name','job'])
df
Out[18]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | teacher |
2 | Fred | teacher |
3 | Abraham | student |
4 | Brian | student |
5 | Janny | developer |
6 | Nate | teacher |
7 | Ian | teacher |
8 | Chris | banker |
9 | Phillip | lawyer |
10 | Phillip | basketball player |
11 | Gwen | teacher |
12 | Jessy | student |
In [19]:
#unique 한 값 뽑기
df.job.unique()
Out[19]:
array(['teacher', 'student', 'developer', 'banker', 'lawyer', 'basketball player'], dtype=object)
In [20]:
df.job.value_counts() #각각의 unique한 value에 몇개가 있는지 추출
Out[20]:
teacher 6 student 3 developer 1 banker 1 lawyer 1 basketball player 1 Name: job, dtype: int64
In [21]:
l1 = [{'name': 'John', 'job': "teacher"},
{'name': 'Nate', 'job': "student"},
{'name': 'Fred', 'job': "developer"}]
l2 = [{'name': 'Ed', 'job': "dentist"},
{'name': 'Jack', 'job': "farmer"},
{'name': 'Ted', 'job': "designer"}]
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['name', 'job'])
In [22]:
df1
Out[22]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Fred | developer |
In [23]:
df2
Out[23]:
name | job | |
---|---|---|
0 | Ed | dentist |
1 | Jack | farmer |
2 | Ted | designer |
In [24]:
#첫번째 방법
result = pd.concat([df1,df2],ignore_index=True) #2개의 데이터프레임을 list 형식으로 넣기
result
Out[24]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Fred | developer |
3 | Ed | dentist |
4 | Jack | farmer |
5 | Ted | designer |
In [25]:
#두번째 방법
result = df1.append(df2,ignore_index =True)
result
Out[25]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Fred | developer |
3 | Ed | dentist |
4 | Jack | farmer |
5 | Ted | designer |
두번째 데이터프레임을 첫번째 데이터프레임의 새로운 컬럼(열)으로 합치기¶
In [26]:
l3 = [{'name': 'John', 'job': "teacher"},
{'name': 'Nate', 'job': "student"},
{'name': 'Jack', 'job': "developer"}]
l4 = [{'age': 25, 'country': "U.S"},
{'age': 30, 'country': "U.K"},
{'age': 45, 'country': "Korea"}]
df1 = pd.DataFrame(l3,columns= ['name','job'])
df2 = pd.DataFrame(l4,columns=['age','country'])
In [27]:
df1
Out[27]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Jack | developer |
In [28]:
df2
Out[28]:
age | country | |
---|---|---|
0 | 25 | U.S |
1 | 30 | U.K |
2 | 45 | Korea |
In [33]:
result = pd.concat([df1,df2],axis = 1,ignore_index= True)
result.columns = ['name','job','age','country']
result
Out[33]:
name | job | age | country | |
---|---|---|---|---|
0 | John | teacher | 25 | U.S |
1 | Nate | student | 30 | U.K |
2 | Jack | developer | 45 | Korea |
In [34]:
label = [1,2,3,4,5]
prediction=[1,2,2,4,4]
In [36]:
comparison =pd.DataFrame({'label':label,'prediction' : prediction})
comparison
Out[36]:
label | prediction | |
---|---|---|
0 | 1 | 1 |
1 | 2 | 2 |
2 | 3 | 2 |
3 | 4 | 4 |
4 | 5 | 4 |
'빅데이터 스터디' 카테고리의 다른 글
프로세스마이닝을 이용한 고객 여정 분석-1 (0) | 2021.10.28 |
---|---|
데이터 시각화와 차트분석 기법 (0) | 2021.09.24 |
Pandas 데이터분석 기초 실습-6 (0) | 2021.09.22 |
Pandas 데이터분석 기초 실습 -5 (0) | 2021.09.22 |
Pandas 데이터분석 기초 실습 -4 (0) | 2021.09.22 |