Position: 吳俊逸 > AI
Python Data Analysis and Modeling
by 吳俊逸 2018-05-23 14:16:46, Reply(0), Views(2253)

## 建立 DataFrame

DataFrame 用來處理結構化(Table like)的資料，有列索引與欄標籤的二維資料集，可以透過 Dictionary 或是 Array 來建立，但也可以利用外部的資料來讀取後來建立，例如： CSV 檔案、資料庫等等。

DataFrame 的操作
❖ 資料描述查看

• .shape
• .describe()
• .tail()
• .columns
• .index
• .info()

## import matplotlib.pyplot as pltimport seaborn as sns

df.dtypes

df.describe() for Continuous numerical variables
• the count of that variable
• the mean
• the standard deviation (std)
• the minimum value
• the IQR (Interquartile Range: 25%, 50% and 75%)
• the maximum value

df.describe(include=['object']) for Categorical variables df.corr() for Pearson Correlation

The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:
•1: total positive linear correlation,
•0: no linear correlation, the two variables most likely do not affect each other
•-1: total negative linear correlation. if you want to use only numeric data, as following:
df=df._get_numeric_data()

## Continuous numerical variables

sns.regplot(x="highway-mpg", y="price", data=df)
df[['highway-mpg', 'price']].corr()

## Categorical variables

sns.boxplot(x="body-style", y="price", data=df) ### ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
F檢驗評分：方差分析假設所有組的平均值相同，計算實際方法偏離假設的程度，並將其報告為F檢驗評分。

P-value: P-value tells how statistically significant is our calculated score value

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

P-value: What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the
•p-value is < 0.001 we say there is strong evidence that the correlation is significant,
•the p-value is < 0.05; there is moderate evidence that the correlation is significant,
•the p-value is < 0.1; there is weak evidence that the correlation is significant, and
•the p-value is > 0.1; there is no evidence that the correlation is significant.

from scipy import stats

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

three class: rwd, fwd, 4wd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23

two class: rwd, fwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])

ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23

two class: 4wd, rwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])

ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333

two class: 4wd, fwd
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])

ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666

#### Linear Regression

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X,Y)
lm.intercept_ >> 38423.305858157386
lm.coef_ >> array([-821.73337832])
price = 38423.31 - 821.73 x highway-mpg

lm1 = LinearRegression()
lm1.fit(df[['engine-size']], df[['price']])
lm1.intercept_ >> array([-7963.33890628])
lm1.coef_ >> array([[166.86001569]])
price = -7963.33 + 166.863 x engine-size

#### Multiple Linear Regression

lm = LinearRegression()

Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
lm.fit(Z, df['price'])

lm.intercept_ >> -15806.624626329198
lm.coef_ >> array([53.49574423, 4.70770099, 81.53026382, 36.05748882])

Price = -15678.742628061467 + 52.65851272 x horsepower + 4.69878948 x curb-weight + 81.95906216 x engine-size + 33.58258185 x highway-mpg

### Model Evaluation using Visualization

import seaborn as sns

### Regression Plot

sns.regplot(x="highway-mpg", y="price", data=df)

We can see from this plot that price is negatively correlated to highway-mpg, since the regression slope is negative. One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data, and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data. Use the method ".corr()" to verify above figure

df[["highway-mpg","price"]].corr()

### Residual Plot

What is a residual?

The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

So what is a residual plot?

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

What do we pay attention to when looking at a residual plot?

We look at the spread of the residuals:
•If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

sns.residplot(df['highway-mpg'], df['price'])

We can see from this residual plot that the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data. ### Multiple Linear Regression (distribution plot)

Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]

Y_hat = lm.predict(Z)

ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")

sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)

We can see that the fitted values are reasonably close to the actual values, since the two distributions overlap a bit. However, there is definitely some room for improvement. ## Polynomial Regression

def PlotPolly(model,independent_variable,dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)

plt.plot(independent_variable,dependent_variabble,'.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')

plt.show()
plt.close()

x = df['highway-mpg']
y = df['price']

Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.

f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p) >>
`-1.557 x3 + 204.8 x2 - 8965 x + 1.379e+05`

PlotPolly(p,x,y, 'highway-mpg') np.polyfit(x, y, 3) >> array([-1.55663829e+00, 2.04754306e+02, -8.96543312e+03, 1.37923594e+05])

Create 11 order polynomial model with the variables x and y

f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
PlotPolly(p1,x,y, 'highway-mpg') print(p1) >>
`-1.243e-08 x11  + 4.722e-06 x10  - 0.0008028 x9 + 0.08056 x8 - 5.297 x7 + 239.5 x6 `
```- 7588 x5 + 1.684e+05 x4 - 2.565e+06 x3 + 2.551e+07 x2 - 1.491e+08 x + 3.879e+08
```
`Pipelines`
`Data Pipelines simplify the steps of processing the data. We use the module  Pipeline to create a pipeline. `
`We also use StandardScaler as a step in our pipeline.`
```from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler```
`Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]`
`pipe=Pipeline(Input) #we input the list as an argument to the pipeline constructor `
`pipe.fit(Z,y) #We can normalize the data,  perform a transform and fit the model simultaneously`
`ypipe=pipe.predict(Z) #Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously`
`ypipe[0:4]`
```Example: Create a pipeline that Standardizes the data, then perform prediction using a linear regression model using the features Z and targets y  Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]```
```Measures for In-Sample EvaluationWhen evaluating our models, not only do we want to visualise the results, but we also want a quantitative measure to determine how accurate the model is.Two very important measures that are often used in Statistics to determine the accuracy of a model are:R-squared (0 is bad, 1 is good)R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line. The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.   Mean Squared Error (MSE) << the small MES value is a better fit The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ).Model 1: Simple Linear Regression#x = df['highway-mpg'], y = df['price']lm.fit(X, Y)lm.score(X, Y) >> 0.4965911884339175 [R^2]We can say that ~ 49.659% of the variation of the price is explained by this simple linear model.Yhat=lm.predict(X)Yhat[0:4]#lets import the function mean_squared_error from the module metricsfrom sklearn.metrics import mean_squared_error#mean_squared_error(Y_true, Y_predict)
mean_squared_error(df['price'], Yhat) >> 31635042.944639895Model 2: Multiple Linear Regression# fit the model # Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
lm.fit(Z, df['price'])
# Find the R^2
lm.score(Z, df['price']) >> 0.8093562806577457We can say that ~ 80.896 % of the variation of price is explained by this multiple linear regression.Y_predict_multifit = lm.predict(Z)mean_squared_error(df['price'], Y_predict_multifit) >> 11980366.87072649Model 3: Polynomial Fitfrom sklearn.metrics import r2_scorer_squared = r2_score(y, p(x))
r_squared >> 0.6741946663906513mean_squared_error(df['price'], p(x)) >> 20474146.42636125```