探索性數據分析方法與應用一、頻率和眾數是簡單描述數據分布狀況的常見度量,請編寫函數實現序列元素頻率序列及其眾數的計算python數據分析的一般步驟python數據分析的一般步驟,并自行構建數據驗證方法。
代碼:
# -*- coding: utf-8 -*-
#頻率 and 眾數
freDict={}
#統計元素數量
def count(l):
for item in l:
if item in freDict.keys():
freDict[item] +=1
else:
freDict[item] =1
return
#求元素頻率
def transform():
s = float(sum(freDict.values()))
for k in freDict.keys():
freDict[k] /=s
return
#獲取眾數
def mode():
max_value = list(freDict.values())[0]
max_key = list(freDict.keys())[0]

for key,value in freDict.items():
if value>max_value:
max_value=value
max_key=key
return max_key
l=['a','b','c','d','e','a','b','c','d','e','c','d','e']
count(l)
print("Count for Distinct:",freDict)
transform()
print("Frequency by precent:",freDict)
print("Mode:",mode())
復制代碼
輸出結果:
二、百分位數也是簡單描述數據分布特征的常用度量,請編寫函數實現數據序列百分位數的計算,并計算iris數據集中四個屬性的百分位數。
代碼:
# -*- coding: utf-8 -*-
from math import *
import numpy as np
iris = np.loadtxt(r"E:\iris_proc.data",delimiter=",")
rel = np.linspace(0,0, 11*5).reshape(11,5)

rel[...,0]=range(0,101,10)
for col in range(1,5):
rel[...,col]=[np.percentile(iris[...,col-1], p) for p in rel[...,0]]
print(rel)
復制代碼
輸出結果:
三、衡量數據序列集中程度的統計量通常由均值、中位數和截斷均值,請編寫函數分別實現均值、中位數和截斷均值的計算,并分別針對iris數據集的四個屬性進行計算。
代碼:
# -*- coding: utf-8 -*-
import math
import numpy as np
def mean(x):
return sum(x)/ float(len(x))
def median(y):
x=np.sort(y)
if len(x)%2==0:
return (x[len(x)//2]+x[len(x)//2+1])/2.0
else:
return x[len(x)//2]
def trimmean(x,p):
b= np.percentile(x, p//2)

t= np.percentile(x, 100-p//2)
return mean([i for i in x if b <= i <= t])
iris = np.loadtxt(r"E:\iris_proc.data",delimiter=",")
rel = np.linspace(0,0, 3*4).reshape(3,4)
for i in range(3):
for j in range(4):
if(i==0):
rel[0,j] = mean(iris[...,j])
elif(i==1):
rel[1,j] = median(iris[...,j])
else:
rel[2,j] = trimmean(iris[...,j], 20)
print(rel)
復制代碼
輸出結果:
四、簡單描述數據序列分散程度的統計量通常由極差、標準差、絕對平均偏差(AAD)、中位數絕對偏差(MAD)、四分位數極差(IQR)等,請編寫函數實現這些統計量的計算,并針對iris數據集的四個屬性進行計算。
代碼:
import numpy as np

from math import *
def rang(x):
return max(x)-min(x)
def var(x):
return np.var(x)*len(x)/(len(x)-1)
def std(x):
return sqrt(var(x))
def aad(x):
x_mean = np.mean(x)
return sum([abs(x[i]-x_mean) for i in range(len(x))])/len(x)
def mad(x):
x_median = np.median(x)
return np.median([abs(x[i]-x_median) for i in range(len(x))])
def iqr(x):
return np.percentile(x,75)-np.percentile(x,25)
iris = np.loadtxt("E:\iris_proc.data",delimiter=',')
rel=np.linspace(0,0,5*4).reshape(5,4)
for col in range(4):
rel[0, col] = rang(iris[..., col])

rel[1, col] = std(iris[..., col])
rel[2, col] = aad(iris[..., col])
rel[3, col] = mad(iris[..., col])
rel[4, col] = iqr(iris[..., col])
print(rel)
復制代碼
輸出結果:
五、莖葉圖是描述數據分布的一種簡單可視化方法,請變成實現莖葉圖的輸出,完成iris數據集中的第一個屬性萼片長度的莖葉圖輸出。
代碼:
from itertools import groupby
import numpy as np
iris = np.loadtxt("E:\iris_proc.data",delimiter=',')
data = iris[...,0]*10
data = sorted([str(int(e)) for e in data])
#k 和 h 分別為每個數值的十位數字和個位數字的字符形式
for k,g in groupby(data,key=lambda x:int(x)//5):
lst = map(str,[int(h) for h in list(g)])
print(k//2,'|',''.join(lst))
復制代碼
輸出結果: