Notice

Recent Posts

Recent Comments

Link

도개진 Git

« 2024/11 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Archives

Today

Total

관리 메뉴

도찐개찐

[데이터분석] 03. 데이터 시각화 본문

PYTHON/데이터분석

[데이터분석] 03. 데이터 시각화

도개진 2023. 1. 2. 12:28

!conda install -y matplotlib

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/miniconda3

  added / updated specs:
    - matplotlib


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotli-1.0.9               |       h5eee18b_7          18 KB
    brotli-bin-1.0.9           |       h5eee18b_7          19 KB
    cycler-0.11.0              |     pyhd3eb1b0_0          12 KB
    fonttools-4.25.0           |     pyhd3eb1b0_0         632 KB
    freetype-2.12.1            |       h4a9f257_0         626 KB
    kiwisolver-1.4.2           |   py39h295c915_0          83 KB
    lcms2-2.12                 |       h3be6417_0         312 KB
    libbrotlicommon-1.0.9      |       h5eee18b_7          70 KB
    libbrotlidec-1.0.9         |       h5eee18b_7          31 KB
    libbrotlienc-1.0.9         |       h5eee18b_7         264 KB
    matplotlib-3.5.3           |   py39h06a4308_0           7 KB
    matplotlib-base-3.5.3      |   py39hf590b9c_0         6.4 MB
    munkres-1.1.4              |             py_0          13 KB
    pillow-9.2.0               |   py39hace64e9_1         670 KB
    ------------------------------------------------------------
                                           Total:         9.1 MB

The following NEW packages will be INSTALLED:

  brotli             pkgs/main/linux-64::brotli-1.0.9-h5eee18b_7 None
  brotli-bin         pkgs/main/linux-64::brotli-bin-1.0.9-h5eee18b_7 None
  cycler             pkgs/main/noarch::cycler-0.11.0-pyhd3eb1b0_0 None
  fonttools          pkgs/main/noarch::fonttools-4.25.0-pyhd3eb1b0_0 None
  kiwisolver         pkgs/main/linux-64::kiwisolver-1.4.2-py39h295c915_0 None
  lcms2              pkgs/main/linux-64::lcms2-2.12-h3be6417_0 None
  libbrotlicommon    pkgs/main/linux-64::libbrotlicommon-1.0.9-h5eee18b_7 None
  libbrotlidec       pkgs/main/linux-64::libbrotlidec-1.0.9-h5eee18b_7 None
  libbrotlienc       pkgs/main/linux-64::libbrotlienc-1.0.9-h5eee18b_7 None
  matplotlib         pkgs/main/linux-64::matplotlib-3.5.3-py39h06a4308_0 None
  matplotlib-base    pkgs/main/linux-64::matplotlib-base-3.5.3-py39hf590b9c_0 None
  munkres            pkgs/main/noarch::munkres-1.1.4-py_0 None
  pillow             pkgs/main/linux-64::pillow-9.2.0-py39hace64e9_1 None

The following packages will be UPDATED:

  freetype           conda-forge::freetype-2.10.4-h0708190~ --> pkgs/main::freetype-2.12.1-h4a9f257_0 None



Downloading and Extracting Packages
freetype-2.12.1      | 626 KB    | ##################################### | 100% 
brotli-bin-1.0.9     | 19 KB     | ##################################### | 100% 
libbrotlicommon-1.0. | 70 KB     | ##################################### | 100% 
kiwisolver-1.4.2     | 83 KB     | ##################################### | 100% 
cycler-0.11.0        | 12 KB     | ##################################### | 100% 
fonttools-4.25.0     | 632 KB    | ##################################### | 100% 
brotli-1.0.9         | 18 KB     | ##################################### | 100% 
matplotlib-3.5.3     | 7 KB      | ##################################### | 100% 
matplotlib-base-3.5. | 6.4 MB    | ##################################### | 100% 
munkres-1.1.4        | 13 KB     | ##################################### | 100% 
pillow-9.2.0         | 670 KB    | ##################################### | 100% 
libbrotlidec-1.0.9   | 31 KB     | ##################################### | 100% 
libbrotlienc-1.0.9   | 264 KB    | ##################################### | 100% 
lcms2-2.12           | 312 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Retrieving notices: ...working... done

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

도수분포표

단순한 숫자의 나열인 데이터 자체로는 어떠한 정보도 알 수 없음
- 80명의 학생의 키를 정리한 데이터에서 무엇을 알수 있을까?
각 학생들의 키는 모두 같지않고 제각각의 숫자로 나타남
다양한 수치로 나타나는 것을 분포한다 라고 함
분포가 생기는 이유는 어떤 불확실성(!) 이 있기 때문
이러한 불확실성이 제각각인 키의 수치를 발생시킨다고 여김
- 하지만, 이러한 수치들에도 고유한 특징 이나 반복(패턴) 되는 것이 존재
- 분포의 특성을 도출하기 위해 (확률에 근거한)통계 라는 도구 사용
분포화된 자료를 시각화하려면 히스토그램을 그려야 하는데
- 먼저,도수분포표 를 작성해야 함
  - 최대/최소값
  - 계급class 설정 - 구간
  - 계급값 설정
  - 계급내 데이터수 파악 - 빈도/도수frequency
  - 각 계급 빈도의 상대비율 - 상대도수
  - 각 계급의 누적합 - 누적도수
히스토그램을 통해 데이터의 분포(어떻게 모여 있는지) 파악 가능

연속형 데이터 시각화 1

히스토그램

studs_h = pd.read_csv('../data/height.csv')
studs_h.head()

	height
0	151
1	154
2	160
3	160
4	163

# 그래프의 x축 범위 지정
# height.max()
max = studs_h.max()
min = studs_h.min()
print(max, min, max - min)

height    169
dtype: int64 height    143
dtype: int64 height    26
dtype: int64

# 구간 범위 설정
# x 축 최소값 : 143 -> 140
# x 축 최대값 : 169 -> 170
bmax = int(max / 10) * 10 + 10
bmin = int(min / 10) * 10
print(bmin, bmax)

140 170

# 구간 설정
# 구간은 numpy의 arange 함수 사용
# arange(최소값, 최대값 + 1, 구간간격)
bins = np.arange(bmin, bmax + 1, 5)
print(bins)

[140 145 150 155 160 165 170]

# 구간내 빈도
# 빈도는 numpy의 histogram 함수 사용
hist, bins = np.histogram(studs_h.height, bins)
print(hist)

[ 1  4 17 27 23  8]

# 계급값 - 각 구간을 대표하는 값, 구간의 중앙 값
mid = (bins[1] - bins[0]) / 2
mdbins = bins[:len(bins) - 1] + mid
print(mid, mdbins)

2.5 [142.5 147.5 152.5 157.5 162.5 167.5]

# 상대도수 계산
# 구간별 빈도수를 전체 빈도수의 총합으로 나눈 것
total = len(studs_h.height)
relfrq = hist / total
print(relfrq, sum(relfrq))

[0.0125 0.05   0.2125 0.3375 0.2875 0.1   ] 1.0

# 누적도수 계산
# 각 구간별 빈도의 누적 합
# [1,2,3,4,5] => 누적합(복리개념?) [1, 3, 6, 10, 15]
# 누적합은 numpy의 cumsum 함수 사용
print(np.cumsum(hist))

[ 1  5 22 49 72 80]

[f'{i} ~ {i + 5}' for i in np.arange(bmin, bmax, 5)]

['140 ~ 145', '145 ~ 150', '150 ~ 155', '155 ~ 160', '160 ~ 165', '165 ~ 170']

# 지금까지 계산한 결과 한눈에 보기
# 새로운 컬럼 추가 : 객체명['새로운 컬럼명'] = 리스트
frqclass = [f'{i} ~ {i + 5}' for i in np.arange(bmin, bmax, 5)]
stdhist = pd.DataFrame({'frq':hist}, index=pd.Index(frqclass, name='class'))
stdhist['midbin'] = mdbins
stdhist['relfrq'] = relfrq
stdhist['csfrq'] = np.cumsum(hist)
stdhist

	frq	midbin	relfrq	csfrq
class
140 ~ 145	1	142.5	0.0125	1
145 ~ 150	4	147.5	0.0500	5
150 ~ 155	17	152.5	0.2125	22
155 ~ 160	27	157.5	0.3375	49
160 ~ 165	23	162.5	0.2875	72
165 ~ 170	8	167.5	0.1000	80

히스토그램 시각화

hist(데이터, 구간, 옵션)

plt.hist(studs_h.height, bins, color='red')
plt.grid()

도수분포다각형

plot(x축값, y축값, 옵션)

plt.plot(mdbins, hist)
plt.grid()

png

pandas 로 확률밀도 추정 그래프 그리기

객체명.plot(kind='kde')

studs_h.height.plot(kind='kde')
plt.grid()

!conda install -y seaborn

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Retrieving notices: ...working... done

seaborn 로 확률밀도 추정 그래프 그리기

객체명.histplot(대상, kin=True)

import seaborn as sns

sns.histplot(studs_h.height, kde=True)
plt.grid()

확률밀도 추정KDE

kernal density estimation
관측된 데이터들의 분포로부터 원래 변수의 확률분포특성을 추정
즉, 해당 변수에서 관측된 몇가지 데이터로부터 변수가 가질수 있는 모든 값들에 대한 밀도(확률)를 추정하는 것
- 예) 수능을 위해 모의고사를 실시함
- 모의고사를 통해 실제 수능시험 성적을 예측할 수 있음
- 몇 회의 모의고사 성적에 대한 분포를 토대로
- 실제로 받을 수능시험 성적에 대한 확률을 유추해 볼수 있음
seaborn의 histplot이나 distplot으로 쉽게 그릴수 있음

# 조선조 왕들 수명에 대한 시각화
kings = [73,62,45,53,38,16,51,28,37,30,56,30,33,56,
         66,54,40,33,59,36,82,48,44,22,32,67,52,]

kings_df = pd.DataFrame({'age': kings})
k_max = kings_df.max()
k_min = kings_df.min()
kb_max = int(k_max / 10) * 10 + 10
kb_min = int(k_min / 10) * 10

bins = np.arange(kb_min, kb_max + 1, 10)

hist, bins = np.histogram(kings_df.age, bins)
frqclass = [f'{i} ~ {i + 10}' for i in np.arange(kb_min, kb_max, 10)]

# stdhist = pd.DataFrame({'frq':hist}, index=pd.Index(frqclass, name='class'))
df = pd.DataFrame({'age': hist}, index=pd.Index(frqclass, name='class'))
df
# print(hist)
mid = (bins[1] - bins[0]) / 2
mdbins = bins[:len(bins) - 1] + mid
total = len(kings_df.age)
relfrq = hist / total
# print(relfrq, sum(relfrq))
df['mdbins'] = mdbins
df['relfrq'] = relfrq
df['csfrq'] = np.cumsum(hist)

df

	age	mdbins	relfrq	csfrq
class
10 ~ 20	1	15.0	0.037037	1
20 ~ 30	2	25.0	0.074074	3
30 ~ 40	8	35.0	0.296296	11
40 ~ 50	4	45.0	0.148148	15
50 ~ 60	7	55.0	0.259259	22
60 ~ 70	3	65.0	0.111111	25
70 ~ 80	1	75.0	0.037037	26
80 ~ 90	1	85.0	0.037037	27

plt.hist(kings, bins, color='orange')
plt.grid()

plt.plot(mdbins, hist)

[<matplotlib.lines.Line2D at 0x7f0ef52498b0>]

sns.histplot(kings, bins=bins, kde=True)

<AxesSubplot:ylabel='Count'>

청소년 핸드폰 사용시간에 대한 시각화

도수분표표 작성
히스토그램, KDE 작성

phone = [10,37,22,32,18,15,15,18,22,15,20,25,38,28,
         25,30,20,22,18,22,22,12,22,26,22,32,22,23,
         20,23,23,20,25,51,20,25,26,22,26,28,28,20,
         23,30,12,22,35,11,20,25]
max = np.max(phone)
min = np.min(phone)
bmax = int(max / 10) * 10 + 10
bmin = int(min / 10) * 10
bins = np.arange(bmin, bmax + 1, 5)

hist, bins = np.histogram(phone, bins)

plt.hist(phone, bins, color='orange')
plt.grid()

sns.histplot(phone, bins=bins, kde=True)

<AxesSubplot:ylabel='Count'>

사원들의 연봉의 분포 시각화

emps = pd.read_csv('../data/employees.csv')
emps = emps.SALARY
max = np.max(emps)
min = np.min(emps)

bmax = int(max / 1000) * 1000 + 1000
bmin = int(min / 1000) * 1000

bmax, bmin
bins = np.arange(bmin, bmax + 1, 2500)

hist = np.histogram(emps, bins)

plt.hist(emps, bins)
plt.grid()

sns.histplot(emps, bins=bins, kde=True)

<AxesSubplot:xlabel='SALARY', ylabel='Count'>

타이타닉 승객의 대한 시각화

도수분표표 작성
히스토그램, KDE 작성

titanic = pd.read_csv('../data/titanic.csv')

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   name      1309 non-null   object 
 3   sex       1309 non-null   object 
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64  
 6   parch     1309 non-null   int64  
 7   ticket    1309 non-null   object 
 8   fare      1308 non-null   float64
 9   cabin     295 non-null    object 
 10  embarked  1307 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB

titanic = titanic.age

결측치('null' 과 같이 연산이 불가능한 데이터) 처리 - 제거, 대체

대체 작업의 경우, 중앙값, 사분위수와 같은 값으로 대체할 수 있으나 왜곡발생률이 있으므로 상황 판단에 따라 적절한 대체값 계산이 필요하다

titanic = titanic.dropna() # null 값 제거

max = np.max(titanic)
min = np.min(titanic)

bmax = int(max / 10) * 10 + 10
bmin = int(min / 10) * 10

bins = np.arange(bmin, bmax + 1, 5)
hist = np.histogram(titanic, bins)

plt.hist(titanic, bins=bins)
plt.grid()

sns.histplot(titanic, bins=bins, kde=True)

<AxesSubplot:xlabel='age', ylabel='Count'>

728x90

저작자표시

'PYTHON > 데이터분석' 카테고리의 다른 글

[데이터시각화] 06. 선그래프 (0)	2023.01.02
[데이터시각화] 04. 막대그래프 (0)	2023.01.02
[데이터분석] 02. 통계와 데이터 (0)	2023.01.02
[데이터 분석] 01. 통계와 데이터 (0)	2023.01.02
[Python] 선형회귀 (2)	2022.12.26

'PYTHON/데이터분석' Related Articles

Comments

도찐개찐

[데이터분석] 03. 데이터 시각화 본문

[데이터분석] 03. 데이터 시각화

도수분포표

연속형 데이터 시각화 1

히스토그램 시각화

도수분포다각형

pandas 로 확률밀도 추정 그래프 그리기

seaborn 로 확률밀도 추정 그래프 그리기

확률밀도 추정KDE

청소년 핸드폰 사용시간에 대한 시각화

사원들의 연봉의 분포 시각화

타이타닉 승객의 대한 시각화

결측치('null' 과 같이 연산이 불가능한 데이터) 처리 - 제거, 대체

'PYTHON > 데이터분석' 카테고리의 다른 글

티스토리툴바