2023-06-16

Python / preprocessing

6분 읽기 (대략 873 단어)

Py) 전처리 - 데이터프레임 데이터 유형

데이터프레임의 변수(column)별 데이터타입 확인은 중요하다. 그리고 변수 개수가 매우 많은 경우는 어떻게 처리할 수 있을까? 이와 관련해서 알아보도록 하자.

간단한 데이터프레임

우선 간단한 데이터프레임을 준비해보자.

import pandas as pd
df = pd.DataFrame([["A", 1], ["C", 2],  ["B", 3], ["F", 4], ["D", 5]],
                  columns = ["Rank", "Score"])
df

	Rank	Score
0	A	1
1	C	2
2	B	3

데이터프레임은 .info() 메서드로 column별 상세내용을 확인할 수 있으며 각 column의 dtype 또한 확인할 수 있다.

df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 2 columns):
##  #   Column  Non-Null Count  Dtype 
## ---  ------  --------------  ----- 
##  0   Rank    3 non-null      object
##  1   Score   3 non-null      int64 
## dtypes: int64(1), object(1)
## memory usage: 176.0+ bytes

변수가 매우 많은 데이터프레임

하지만 변수 개수가 1000개가 넘어간다면 어떨까? 샘플링을 통해 임의의 데이터프레임을 만들어보자.
※ seed 고정을 하지 않았기 때문에 코드가 같더라도 생성되는 데이터프레임 원소는 다르다.

df_col1 = pd.DataFrame()
df_col2 = pd.DataFrame()
for n in range(200):
    df_s1 = pd.DataFrame(["A", "B", 1, 2, 3]).sample(n = 5, replace = True)
    df_s2 = pd.DataFrame(["A", "B", 1, 2, 3]).sample(n = 5, replace = True)
    df_col1 = pd.concat([df_col1, df_s1])
    df_col2 = pd.concat([df_col2, df_s2])

df_col = pd.concat([df_col1.reset_index(drop = True), 
                    df_col2.reset_index(drop = True)],
                   axis = 1)
df_col.head()

	0	0
0	A	2
1	2	2
2	1	2
3	A	B
4	A	1

변수 개수가 많은 데이터프레임으로 바꿔주기 위해 .transpose() 메서드를 활용했으며, 데이터타입을 바꿔주는 코드를 별도로 작성하기 귀찮아서 파일을 임시 저장했다가 불러왔다.

df_col_t = df_col.transpose().reset_index(drop = True)
df_col_t.to_csv("test.csv", index = False)
df_col_t = pd.read_csv("test.csv")
df_col_t.head()

	0	1	2	3	4	5	6	7	...	992	993	994	995	996	997	998	999
0	A	2	1	A	A	A	3	B	...	A	1	3	2	3	3	2	2
1	2	2	2	B	1	2	2	3	...	1	2	1	1	1	3	B	1

각 변수의 데이터타입을 확인해보면 정수가 386개, object가 614개인 것을 알 수 있다. 하지만 어떤 변수가 정수인지 아닌지는 바로 알 수 없다는 단점이 있다.

df_col_t.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Columns: 1000 entries, 0 to 999
## dtypes: int64(386), object(614)
## memory usage: 15.8+ KB

더 나아가서 특정 데이터타입인 변수
.apply() 메서드와 lambda 함수를 활용하여 각 변수의 데이터타입을 뽑아 “df_col_t_dtype” 객체에 저장하고 이를 확인하면 다음과 같다.

df_col_t_dtype = df_col_t.apply(lambda x: x.dtype)
df_col_t_dtype.head(3)
## 0    object
## 1     int64
## 2     int64
## dtype: object

“df_col_t_dtype” 를 활용하여 정수형 변수만 필터링하는 코드는 다음과 같다.

1
2
3

col_int = df_col_t_dtype[df_col_t_dtype == "int64"].index
df_col_t_int = df_col_t.loc[:, col_int]
df_col_t_int.head()

	1	2	6	8	13	14	15	20	...	989	990	993	994	995	996	997	999
0	2	1	3	2	3	3	3	2	...	2	2	1	3	2	3	3	2
1	2	2	2	1	1	1	1	3	...	1	1	2	1	1	1	3	1

df_col_t_int.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Columns: 386 entries, 1 to 999
## dtypes: int64(386)
## memory usage: 6.2 KB

그리고 상기 코드를 훨씬 간결하게해주는 .select_dtypes() 메서드가 있다. 해당 메서드는 “include”와 “exclude”가 주요 인자이며 해당 인자에 데이터 유형을 지정하면 특정 유형의 변수를 포함(include)하거나 제외(exclude)할 수 있다. 다음 코드와 같이 “include” 인자에 “number”를 할당하면 수치형 변수를 일괄로 필터링해준다.

df_col_t_num = df_col_t.select_dtypes(include = "number")
df_col_t_num.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Columns: 356 entries, 1 to 999
## dtypes: int64(356)
## memory usage: 5.7 KB

다음 코드는 수치형 변수를 제외한 나머지 변수를 필터링 하는 코드이다.

df_col_t_cat = df_col_t.select_dtypes(exclude = "number")
df_col_t_cat.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Columns: 644 entries, 0 to 998
## dtypes: object(644)
## memory usage: 10.2+ KB

# dtype, matching, preprocessing, python, 데이터유형, 데이터타입, 매칭, 전처리, 파이썬

Py) 전처리 - 데이터프레임 데이터 유형

간단한 데이터프레임

변수가 매우 많은 데이터프레임

Inflearn에서 인강을 평생 소장하세요!

카탈로그

Inflearn에서 인강을 평생 소장하세요!

카테고리

Your browser is out-of-date!