建立 DataFrame
DataFrame 用來處理結構化(Table like)的資料,有列索引與欄標籤的二維資料集,可以透過 Dictionary 或是 Array 來建立,但也可以利用外部的資料來讀取後來建立,例如: CSV 檔案、資料庫等等。
DataFrame 的操作
❖ 資料描述查看
可以透過下列方法查看目前資料的資訊
- .shape
- .describe()
- .head()
- .tail()
- .columns
- .index
- .info()
import matplotlib.pyplot as plt
import seaborn as sns
df.dtypes
df.describe() for Continuous numerical variables
- the count of that variable
- the mean
- the standard deviation (std)
- the minimum value
- the IQR (Interquartile Range: 25%, 50% and 75%)
- the maximum value
df.describe(include=['object']) for Categorical variables
df.corr() for Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:
•1: total positive linear correlation,
•0: no linear correlation, the two variables most likely do not affect each other
•-1: total negative linear correlation.
if you want to use only numeric data, as following:
df=df._get_numeric_data()
Continuous numerical variables
sns.regplot(x="highway-mpg", y="price", data=df)
df[['highway-mpg', 'price']].corr()
sns.boxplot(x="body-style", y="price", data=df)
ANOVA: Analysis of Variance
The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:
F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
F檢驗評分:方差分析假設所有組的平均值相同,計算實際方法偏離假設的程度,並將其報告為F檢驗評分。
分數越大意味著手段之間的差異越大。
P-value: P-value tells how statistically significant is our calculated score value
If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.
如果我們的價格變量與我們正在分析的變量密切相關,則期望ANOVA返回一個相當大的F檢驗分數和一個小的p值。
P-value: What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.
By convention, when the
•p-value is < 0.001 we say there is strong evidence that the correlation is significant,
•the p-value is < 0.05; there is moderate evidence that the correlation is significant,
•the p-value is < 0.1; there is weak evidence that the correlation is significant, and
•the p-value is > 0.1; there is no evidence that the correlation is significant.
from scipy import stats
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
three class: rwd, fwd, 4wd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])
ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23
two class: rwd, fwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])
ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23
two class: 4wd, rwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])
ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333
two class: 4wd, fwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])
ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X,Y)
lm.intercept_ >> 38423.305858157386
lm.coef_ >> array([-821.73337832])
price = 38423.31 - 821.73 x highway-mpg
lm1 = LinearRegression()
lm1.fit(df[['engine-size']], df[['price']])
lm1.intercept_ >> array([-7963.33890628])
lm1.coef_ >> array([[166.86001569]])
price = -7963.33 + 166.863 x engine-size
Multiple Linear Regression
lm = LinearRegression()
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
lm.fit(Z, df['price'])
lm.intercept_ >> -15806.624626329198
lm.coef_ >> array([53.49574423, 4.70770099, 81.53026382, 36.05748882])
Price = -15678.742628061467 + 52.65851272 x horsepower + 4.69878948 x curb-weight + 81.95906216 x engine-size + 33.58258185 x highway-mpg
Model Evaluation using Visualization
import seaborn as sns
sns.regplot(x="highway-mpg", y="price", data=df)
We can see from this plot that price is negatively correlated to highway-mpg, since the regression slope is negative. One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data, and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data.
Use the method ".corr()" to verify above figure
df[["highway-mpg","price"]].corr()
What is a residual?
The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.
So what is a residual plot?
A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.
What do we pay attention to when looking at a residual plot?
We look at the spread of the residuals:
•If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.
sns.residplot(df['highway-mpg'], df['price'])
We can see from this residual plot that the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data.
Multiple Linear Regression (distribution plot)
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
Y_hat = lm.predict(Z)
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)
We can see that the fitted values are reasonably close to the actual values, since the two distributions overlap a bit. However, there is definitely some room for improvement.
def PlotPolly(model,independent_variable,dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)
plt.plot(independent_variable,dependent_variabble,'.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')
plt.show()
plt.close()
x = df['highway-mpg']
y = df['price']
Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p) >>
-1.557 x3 + 204.8 x2 - 8965 x + 1.379e+05
PlotPolly(p,x,y, 'highway-mpg')
np.polyfit(x, y, 3) >> array([-1.55663829e+00, 2.04754306e+02, -8.96543312e+03, 1.37923594e+05])
Create 11 order polynomial model with the variables x and y
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
PlotPolly(p1,x,y, 'highway-mpg')
print(p1) >>
-1.243e-08 x11 + 4.722e-06 x10 - 0.0008028 x9 + 0.08056 x8 - 5.297 x7 + 239.5 x6
- 7588 x5 + 1.684e+05 x4 - 2.565e+06 x3 + 2.551e+07 x2 - 1.491e+08 x + 3.879e+08
Pipelines
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline.
We also use StandardScaler as a step in our pipeline.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
pipe=Pipeline(Input) #we input the list as an argument to the pipeline constructor
pipe.fit(Z,y) #We can normalize the data, perform a transform and fit the model simultaneously
ypipe=pipe.predict(Z) #Similarly, we can normalize the data, perform a transform and produce a prediction simultaneously
ypipe[0:4]
Example: Create a pipeline that Standardizes the data, then perform prediction using a linear regression model using the features Z and targets y Input=[('scale',StandardScaler()),('model',LinearRegression())]
pipe=Pipeline(Input)
pipe.fit(Z,y)
ypipe=pipe.predict(Z)
ypipe[0:10] |
Measures for In-Sample Evaluation
When evaluating our models, not only do we want to visualise the results,
but we also want a quantitative measure to determine how accurate the model is.
Two very important measures that are often used in Statistics to determine the accuracy of a model are:
R-squared (0 is bad, 1 is good)
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.
Mean Squared Error (MSE) << the small MES value is a better fit
The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ).
Model 1: Simple Linear Regression
#x = df['highway-mpg'], y = df['price']
lm.fit(X, Y)
lm.score(X, Y) >> 0.4965911884339175 [R^2]
We can say that ~ 49.659% of the variation of the price is explained by this simple linear model.
Yhat=lm.predict(X)
Yhat[0:4]
#lets import the function mean_squared_error from the module metrics
from sklearn.metrics import mean_squared_error
#mean_squared_error(Y_true, Y_predict)
mean_squared_error(df['price'], Yhat) >> 31635042.944639895
Model 2: Multiple Linear Regression
# fit the model
# Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
lm.fit(Z, df['price'])
# Find the R^2
lm.score(Z, df['price']) >> 0.8093562806577457
We can say that ~ 80.896 % of the variation of price is explained by this multiple linear regression.
Y_predict_multifit = lm.predict(Z)
mean_squared_error(df['price'], Y_predict_multifit) >> 11980366.87072649
Model 3: Polynomial Fit
from sklearn.metrics import r2_score
r_squared = r2_score(y, p(x))
r_squared >> 0.6741946663906513
mean_squared_error(df['price'], p(x)) >> 20474146.42636125