High school students and their future predictions

Chapter 1: Introduction

This study shows how does the educational and social experiences affect different paths of student’s life especially in this case we are focused on the high school performance of the student. High school performance of the student reveals many aspects of their future life choices for example whether they will attend college or not, what kind of jobs and salary they will get. All these future choices are affected by the various factors which are the part of the student’s life. Based on the MaryBeth Walpole (2003) study, low socioeconomic status of the student affects its extracurricular activities, GPA, and study performance compared to its peers of high socioeconomic status. As a result, these low socioeconomic students get lower income, education attainment, and support in life. A similar study was done by Debbie Hahs-Vaughn (2004) where they studied the impacts of parent’s education on the students. First generations students whose parents didn’t attend college faced hurdles in GPA, entrance exams and aspiration for higher education. Studying factors which can affect the student’s performance in high school can provide guidance to the parents and students itself in shaping their future. Good high school performance can help the student to get into better school with a scholarship, opens the door for better future opportunities, build a better social life by earning respect from teachers and peers and it also boosts the confidence of the student to face the hurdles in their life.

Chapter 2:Statistic analysis

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a useful technique for reducing a large number of variables (i.e., survey items) into a smaller set of factors. In the current research, a “wide” dataset with many variables is used to explore the high school students’ performance and their future predictions. Therefore, PCA was conducted first to identify the underlying structure in the data.
We select 15 variables from the original data set to do the component analysis.

## Classes 'tbl_df', 'tbl' and 'data.frame':    5010 obs. of  16 variables:
##  $ Pgrade   : num  4 3 5 5 4 4 4 3 3 4 ...
##  $ Pexam    : num  4 2 4 5 5 4 5 2 2 5 ...
##  $ Pcollege : num  4 3 5 5 4 4 5 2 3 5 ...
##  $ Pdropout : num  4 5 5 5 4 5 5 4 4 5 ...
##  $ Ttreat   : num  2 3 3 2 3 2 2 2 2 3 ...
##  $ Tinterest: num  3 2 4 3 4 3 2 3 2 3 ...
##  $ Teasy    : num  3 2 4 1 4 2 3 3 2 3 ...
##  $ Tthink   : num  3 3 3 4 4 3 4 3 3 4 ...
##  $ Tgiveup  : num  2 3 4 3 4 3 4 3 3 3 ...
##  $ PAcourse : num  2 4 4 4 3 3 3 4 4 4 ...
##  $ PAexam   : num  1 4 4 4 4 4 4 2 3 4 ...
##  $ PAappli  : num  3 4 4 4 4 3 4 4 4 4 ...
##  $ PAcareer : num  3 4 4 4 4 3 4 4 4 4 ...
##  $ PAjob    : num  3 4 4 4 4 3 2 4 4 1 ...
##  $ PAevent  : num  4 3 4 4 2 4 4 4 4 3 ...
##  $ PAtrouble: num  2 4 4 4 4 3 4 4 4 2 ...
##  - attr(*, "na.action")= 'omit' Named int  4 5 6 7 8 9 10 11 12 14 ...
##   ..- attr(*, "names")= chr  "4" "5" "6" "7" ...

The original correlation plot demonstrates that most items have some correlation with each other.

Correlation network plots were conducted to illustrate the strength and sign of the correlations between each pair of variables.

The relatively high correlations among items would be a good candidate for PCA and factor analysis. After rearranging this correlation plot, there are three highly correlated item clusters, indicating these items’ interrelationships can be broken down into three components.

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.9609 1.7223 1.3455 0.91858 0.9042 0.86558 0.82802
## Proportion of Variance 0.2403 0.1854 0.1131 0.05274 0.0511 0.04683 0.04285
## Cumulative Proportion  0.2403 0.4257 0.5389 0.59159 0.6427 0.68952 0.73237
##                            PC8     PC9    PC10    PC11   PC12    PC13
## Standard deviation     0.80763 0.77198 0.75643 0.72906 0.7178 0.65666
## Proportion of Variance 0.04077 0.03725 0.03576 0.03322 0.0322 0.02695
## Cumulative Proportion  0.77313 0.81038 0.84614 0.87936 0.9116 0.93851
##                           PC14    PC15    PC16
## Standard deviation     0.60718 0.58416 0.52337
## Proportion of Variance 0.02304 0.02133 0.01712
## Cumulative Proportion  0.96155 0.98288 1.00000

Sixteen principal components were obtained. Each of these explains a percentage of the total variation in this dataset. PCA output table and bar chart show that PC1 explains 24% of the total variance, PC2 explains 20% of the variance, and PC3 explains 11% of the total variance, respectively. These three components explain about 55% of the variance. A solution accounts for 55% of the total variance, which is not very high, but in the social science research, where information is often less precise and extracted factors usually explain only 50% to 60%. Therefore, it is common to consider a solution that accounts for 55% of the total variance as satisfactory (Hair, J. F., 2014).

The clyster graph illustrates which items are similar to one another. For example, all teachers’ items cluster together at the top. This makes sense, as all of these items are related to students’ perception of teacher support.

One criterion is the chosen components that have eigenvalues greater than 1. The scree plot confirms that the first three components have an eigenvalue greater than 1. Therefore, the three component solution is plausible for this study.

In the factor analysis, the unobserved or latent variable that makes up common variance is called a factor. The factor analysis plot demonstrates that there are three very clear underlying structures of these 16 items. All parent items are loaded in factor 1, all teacher items are loaded in factor 2, and all peer items are loaded in factor 3, and hence the factor 1 was named parental involvement, factor 2 was named students’ perception of their teachers, and factor 3 was named peer influence.

Form the factor loading table, all observed items loadings are larger than .5. A high value of factor loading represents high convergent validity.
In the last step, factor scores for these three latent factors were generated for the following analysis.

SMART QUESTION 1: What factors determine whether students go to college or not?

One of the most significant and challenging decisions you will ever make is whether or not to go to college(Mckay.D.R.,2019). The decision may be influenced in many ways. Someone choose to go to college may because their parents tell them to do, or may because their friends all do, and someone chooses not to go to college may because they cannot afford it, or they do not get admitted. Here, we try to find what factors will influence a students decision to go to college, and what are the most important ones?

Data cleaning

To build the model, we select 25 variables from the raw data as independent variables, and one categorical variable as the dependent variable. Replace all missing value by “NA”, and remove all the “NA” rows. For the dependent variables, there are multi-levels given the reasons that students are not enrolled in college, so we set all these reasons as not enroll by integer 0, and set enroll by 1.
Then we change categorical variables’ data type to factor and rename the columns name for easy understanding. Figure below shows the final data structure. We have a total of 8 numeric variables and 18 factor variables.

## Classes 'tbl_df', 'tbl' and 'data.frame':    2297 obs. of  26 variables:
##  $ Mathscore_grade9          : num  59.4 47.7 64.2 66.7 56.2 ...
##  $ Mathscore_grade11         : num  68.6 54.1 55.6 64.7 53.6 ...
##  $ Race                      : Factor w/ 8 levels "1","2","3","4",..: 8 8 3 8 8 8 8 2 8 6 ...
##  $ NativeLanguage            : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 1 1 ...
##  $ Quintile_math_score_grade9: Factor w/ 5 levels "1","2","3","4",..: 5 2 5 5 4 5 1 4 5 5 ...
##  $ Mom_education             : Factor w/ 7 levels "0","1","2","3",..: 6 4 7 7 6 5 3 5 4 4 ...
##  $ Mom_employment_status     : Factor w/ 4 levels "0","2","3","4": 4 3 4 4 3 4 4 4 4 4 ...
##  $ Mom_occupation            : Factor w/ 24 levels "0","11","13",..: 11 18 7 11 18 17 12 3 9 7 ...
##  $ Mom_race                  : Factor w/ 8 levels "0","2","3","4",..: 7 5 3 7 7 7 7 2 7 7 ...
##  $ Dad_education             : Factor w/ 7 levels "0","1","2","3",..: 6 3 1 7 4 7 1 5 3 4 ...
##  $ Dad_employment_status     : Factor w/ 4 levels "0","2","3","4": 4 2 1 4 4 4 1 4 4 4 ...
##  $ Dad_occupation            : Factor w/ 24 levels "0","11","13",..: 7 22 1 11 17 8 1 18 23 21 ...
##  $ Dad_race                  : Factor w/ 8 levels "0","2","3","4",..: 7 7 1 7 7 7 1 5 7 6 ...
##  $ Social_economic_status    : num  1.56 -0.37 1.27 2.57 0.14 ...
##  $ Math_teacher_race         : Factor w/ 7 levels "2","3","4","5",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ Math_teacher's_certificate: Factor w/ 5 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ School_Private/Public     : Factor w/ 2 levels "1","2": 1 1 1 1 1 2 1 2 1 1 ...
##  $ School_Urban/City         : Factor w/ 4 levels "1","2","3","4": 4 4 2 2 2 1 1 2 4 4 ...
##  $ School_State              : Factor w/ 4 levels "1","2","3","4": 2 1 4 3 2 3 1 1 4 3 ...
##  $ Sex                       : Factor w/ 2 levels "1","2": 1 2 2 1 1 1 2 2 1 1 ...
##  $ STEM_grade12_GPA          : num  3 4 2.5 3 3 3 1.5 2.5 4 3 ...
##  $ Math_grade12_GPA          : num  3 4 2.5 3 2.5 3 2 2.5 3.5 3.5 ...
##  $ PC1_parents_involvement   : num  -1.161 0.952 0.88 0.885 -0.207 ...
##  $ PC2_teahers_involvement   : num  -0.349 -0.416 1.025 -0.439 -0.401 ...
##  $ PC3_peer_influence        : num  -0.173 -1.142 0.987 1.457 0.567 ...
##  $ Attendcollege             : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int  4 5 6 7 8 9 10 11 12 14 ...
##   ..- attr(*, "names")= chr  "4" "5" "6" "7" ...

Discriptive analysis

We first give an overlook at the dependent variable Attendcollege. The plot shows the student in our dataset attend college about twice as much as the student did not attend college.
Then we give a correlation plot of all the 26 variables, the darker the point, the higher the correlation. So we can find there is multicollinearity between some of our variables.

Lasso regression can deal with the multicollinearity issue, so we will use Lasso to make the variable selection.

Lasso(least absolute shrinkage and selection operator)

First, let’s review the Lasso equation. Lambda is the tuning parameter, and the part with Lambda is the shrinkage penalty. With the increase in Lambda, the coefficients beta will be shrunk to 0. This is the way how Lasso makes the feature selection.

We run the Lasso regression and get two plots. The first one is the coefficient plot, shows more and more coefficient of the independent variables are shrunk to zero with the increase in Log lambda.

The second plot is the cross-validation plot, which shows binomial deviance of the log Lambda, the lower the deviance the better, the left dash line pass the minimum deviance point, where the Lambda is 0.04, the right dash line is one standard error larger than the minimum point. Commonly, the area between the dash lines is good to select. As we do not want to involve too many variables in our model, we use the right line point where the Lambda is about 0.01 to select the variables. The axis on the top indicate at this point there are six variables left, and all the other variables have been shrunk to 0.

By selecting variables coefficient not equal zero, we get the listed six variables:

## [1] "Mathscore_grade11"      "Mom_education4"        
## [3] "Social_economic_status" "Sex2"                  
## [5] "STEM_grade12_GPA"       "PC3_peer_influence"

As coefficient shrunk is in Lasso, it is not the original number, we need to conduct a Logistic regression to confirm the actual coefficient number of the six variables.

Logistic regression

## 
## Call:
## glm(formula = Attendcollege ~ Mathscore_grade11 + Mom_education + 
##     Social_economic_status + Sex + STEM_grade12_GPA + PC3_peer_influence, 
##     family = "binomial", data = df_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6882  -0.7089   0.3940   0.7010   2.4061  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.868150   0.418919  -9.234  < 2e-16 ***
## Mathscore_grade11       0.017869   0.007351   2.431   0.0151 *  
## Mom_education1          0.369377   0.344419   1.072   0.2835    
## Mom_education2          0.438897   0.238247   1.842   0.0654 .  
## Mom_education3          0.537587   0.255733   2.102   0.0355 *  
## Mom_education4          1.125367   0.261934   4.296 1.74e-05 ***
## Mom_education5          0.616141   0.307092   2.006   0.0448 *  
## Mom_education7          0.968663   0.474791   2.040   0.0413 *  
## Social_economic_status  0.336600   0.106663   3.156   0.0016 ** 
## Sex2                    0.564518   0.111004   5.086 3.67e-07 ***
## STEM_grade12_GPA        1.007335   0.086044  11.707  < 2e-16 ***
## PC3_peer_influence      0.453796   0.063401   7.158 8.21e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2847.2  on 2296  degrees of freedom
## Residual deviance: 2105.2  on 2285  degrees of freedom
## AIC: 2129.2
## 
## Number of Fisher Scoring iterations: 5

##            (Intercept)      Mathscore_grade11         Mom_education1 
##             0.02089699             1.01803007             1.44683349 
##         Mom_education2         Mom_education3         Mom_education4 
##             1.55099614             1.71187132             3.08134779 
##         Mom_education5         Mom_education7 Social_economic_status 
##             1.85176768             2.63442070             1.40017858 
##                   Sex2       STEM_grade12_GPA     PC3_peer_influence 
##             1.75859927             2.73829499             1.57427676

From the output we can see, the variables are significant. Then, we produce a Log Odd, by taking the exponential of the coefficients, for every unit increase in STEM GPA the odds-ratio of attend college is multiplied by 2.74, and for mother’s education is level 4, the odds-ratio of attending college is multiplied by 3.08. And we can conclude, STEM_grade12_GPA and Mom_education are more critical than other variables when making the prediction.

Model evaluation

We use the ROC curve to do the model evaluation, and the area under the curve is about 83%, which indicate our six variables model is a very good model.

Below, we also use the Random forest and Decision tree to run the same model with the six variables and make a comparison between the three results.

Random Forest Model
Random Forest is one of the most widely used machine learning algorithms for classification. The best part of the algorithm is that there are very few assumptions attached to it, so data preparation is less challenging and results in time-saving(Bhalla.D).

## 
## Call:
##  randomForest(formula = Attendcollege ~ Mathscore_grade11 + Mom_education +      Social_economic_status + Sex + STEM_grade12_GPA + PC3_peer_influence,      data = df_clean, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 22.73%
## Confusion matrix:
##     0    1 class.error
## 0 369  345   0.4831933
## 1 177 1406   0.1118130

We use the default number of tree 500, and as we only have 6 variables, our best split is 2. Out of bag data is equivalent to test data, each tree is tested on the 1/3rd of the samples not used in building that tree. So we can estimate the accuracy of the model through the out of bag error. The accuracy rate is about 77%. And from the confusion matrix, we can find, the 0 class error is a little bit higher, this may because of the total samples in 0 class is about half less than samples in class 1.

Mean Decrease Accuracy is how much the model accuracy decreases if we drop that variable. Higher the value of mean decrease accuracy, higher the importance of the variable in the model(Bhalla.D.). In the plot shown above, STEM GPA is the most essential variable, and PC3 peer performance is the second.

Classification tree

The decision tree is a visualized method, and it is easy to understand. We usually build a tree first and then optimize the tree. Finally, we can use the test data to check accuracy. Here, we again use the six variables to build the decision tree. This is the tree we growth. To optimize the tree and find the important factors, we use the complexity parameter(cp). Here we draw a plot, and we choose the cp with the lowest x-error.

From the pruned tree we can find STEM GPA higher than 2.25 students will go to college, if GPA is lower than 2.25 then we need to look at the mother’s education if the mother has a higher education level, the student will still go to college, if mother education is not high, the student will not go to college.

##     predict
## real   0   1
##    0  73  81
##    1  25 280

Then we use the 20% test data to make a prediction. From the matrix we can see, the class 0 error is a little bit higher, this is also reflected in both the original tree and pruned tree. The reason is the unbias data between the two classes. However, the total accuracy rate of our model is good about 76.91%.

Comparison

Finally, we compare the three methods we used. The Logistic regression has the highest accuracy, Random forest, and decision tree almost the same, a little bit lower than Logistic regression. For the important variables we found, STEM GPA are both important in the three methods, and mother education is second important in Logistic regression and Decision tree, but for the random forest, peer influence is more important than mother education.

SMART Question 2: What are the responsible factors for income, when students start working?

In this research questons, we will seek to find out what are the responsible factors that has impact on student’s future earning. We will make use of almost same variables data set to know the results for predicting the income.

Data cleaning

The following is the data structure of the subset for income prediction.The table consitute the INCOME_CAT as the Y variable as binomial. The y variables INCOME_CAT has been classified into two levels: Level 1 as income below $20,000 and Level 2 as income level above $20,000. The subset utilizes 16 independent variables of mixed datatype for predicting income of the students when they start working.

## Classes 'tbl_df', 'tbl' and 'data.frame':    3686 obs. of  17 variables:
##  $ MATHscore_9th_grade     : num  59.4 47.7 64.2 66.7 53.9 ...
##  $ LOCAL                   : Factor w/ 4 levels "1","2","3","4": 4 4 2 1 1 2 1 1 2 2 ...
##  $ REGION                  : Factor w/ 4 levels "1","2","3","4": 2 1 4 3 2 2 3 1 1 3 ...
##  $ NativeLanguage          : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 1 1 2 1 ...
##  $ MOM_education           : Factor w/ 8 levels "0","1","2","3",..: 7 5 8 8 3 7 6 4 6 1 ...
##  $ Dad_education           : Factor w/ 8 levels "0","1","2","3",..: 7 3 1 8 3 5 8 1 6 2 ...
##  $ Family_Income           : Factor w/ 13 levels "1","2","3","4",..: 11 3 6 13 3 3 13 1 6 5 ...
##  $ SEX                     : Factor w/ 2 levels "1","2": 1 2 2 1 1 1 1 2 2 1 ...
##  $ RACE                    : Factor w/ 8 levels "1","2","3","4",..: 8 8 3 8 5 8 8 8 2 5 ...
##  $ AllAcademic_grade_12_GPA: Factor w/ 9 levels "0.25","0.5","1",..: 8 9 6 8 6 7 8 4 7 6 ...
##  $ INCOME_CAT              : Factor w/ 2 levels "1","2": 1 2 1 1 1 1 1 2 1 2 ...
##  $ SES                     : num  1.565 -0.331 1.014 2.151 -0.666 ...
##  $ Math_grade_12_GPA       : num  3 4 2.5 3 3 2.5 3 2 2.5 2 ...
##  $ STEM_grade_12_GPA       : num  3 4 2.5 3 3 3 3 1.5 2.5 2.5 ...
##  $ PC1                     : num  -1.161 0.952 0.88 0.885 0.443 ...
##  $ PC2                     : num  -0.349 -0.416 1.025 -0.439 1.424 ...
##  $ PC3                     : num  -0.173 -1.142 0.987 1.457 -0.123 ...
##  - attr(*, "na.action")= 'omit' Named int  4 5 6 7 8 9 10 11 12 14 ...
##   ..- attr(*, "names")= chr  "4" "5" "6" "7" ...

Logistic Regressionis a predictive analysis to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. And, we know that the output is a probability that given input poin belong to a certain class.

Logistic Regression

## 
## Call:
## glm(formula = INCOME_CAT ~ ., family = "binomial", data = df_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4949  -0.5395  -0.3761  -0.2335   3.2144  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.806783   1.225801  -2.290 0.022036 *  
## MATHscore_9th_grade         -0.002826   0.007121  -0.397 0.691499    
## LOCAL2                      -0.067045   0.148685  -0.451 0.652049    
## LOCAL3                       0.204530   0.186597   1.096 0.273033    
## LOCAL4                       0.233881   0.144558   1.618 0.105682    
## REGION2                      0.647008   0.187705   3.447 0.000567 ***
## REGION3                      0.376207   0.184257   2.042 0.041176 *  
## REGION4                      0.552139   0.207760   2.658 0.007870 ** 
## NativeLanguage2              0.348739   0.229480   1.520 0.128589    
## NativeLanguage3              0.093240   0.264137   0.353 0.724089    
## MOM_education1              -0.370150   0.306443  -1.208 0.227089    
## MOM_education2              -0.231991   0.207342  -1.119 0.263192    
## MOM_education3              -0.518691   0.320526  -1.618 0.105609    
## MOM_education4               0.071092   0.231355   0.307 0.758627    
## MOM_education5              -0.566987   0.252756  -2.243 0.024883 *  
## MOM_education6               0.074107   0.300139   0.247 0.804979    
## MOM_education7              -1.020299   0.586340  -1.740 0.081839 .  
## Dad_education1               0.037701   0.244837   0.154 0.877624    
## Dad_education2              -0.010837   0.156121  -0.069 0.944662    
## Dad_education3               0.353931   0.292135   1.212 0.225691    
## Dad_education4              -0.032605   0.211498  -0.154 0.877482    
## Dad_education5              -0.330283   0.219436  -1.505 0.132288    
## Dad_education6              -0.461266   0.316577  -1.457 0.145105    
## Dad_education7              -0.344376   0.413824  -0.832 0.405307    
## Family_Income2               0.165689   0.242041   0.685 0.493629    
## Family_Income3               0.418606   0.253977   1.648 0.099311 .  
## Family_Income4               0.477840   0.277396   1.723 0.084962 .  
## Family_Income5               0.564653   0.295850   1.909 0.056317 .  
## Family_Income6               0.272920   0.325977   0.837 0.402458    
## Family_Income7               0.098888   0.379406   0.261 0.794370    
## Family_Income8               0.509535   0.385103   1.323 0.185797    
## Family_Income9              -0.331325   0.554091  -0.598 0.549866    
## Family_Income10             -0.195032   0.675199  -0.289 0.772695    
## Family_Income11             -0.589179   0.669253  -0.880 0.378668    
## Family_Income12              0.085085   0.800500   0.106 0.915352    
## Family_Income13              0.226632   0.438714   0.517 0.605448    
## SEX2                        -0.722050   0.116426  -6.202 5.58e-10 ***
## RACE2                        0.343202   1.083078   0.317 0.751337    
## RACE3                        0.799271   1.060324   0.754 0.450970    
## RACE4                       -0.131906   1.493977  -0.088 0.929645    
## RACE5                        1.171529   1.054562   1.111 0.266605    
## RACE6                        1.210711   1.058832   1.143 0.252856    
## RACE7                        0.743480   1.508333   0.493 0.622073    
## RACE8                        1.397161   1.047789   1.333 0.182389    
## AllAcademic_grade_12_GPA1   -1.020901   0.583951  -1.748 0.080418 .  
## AllAcademic_grade_12_GPA1.5 -0.430714   0.494873  -0.870 0.384108    
## AllAcademic_grade_12_GPA2   -0.586456   0.511773  -1.146 0.251824    
## AllAcademic_grade_12_GPA2.5 -0.591727   0.545147  -1.085 0.277725    
## AllAcademic_grade_12_GPA3   -0.869363   0.596297  -1.458 0.144858    
## AllAcademic_grade_12_GPA3.5 -1.378707   0.662552  -2.081 0.037443 *  
## AllAcademic_grade_12_GPA4   -1.622007   0.750957  -2.160 0.030779 *  
## SES                         -0.176304   0.178321  -0.989 0.322815    
## Math_grade_12_GPA            0.332072   0.162842   2.039 0.041427 *  
## STEM_grade_12_GPA           -0.154182   0.208395  -0.740 0.459389    
## PC1                         -0.086251   0.057415  -1.502 0.133039    
## PC2                         -0.116402   0.052626  -2.212 0.026976 *  
## PC3                         -0.247642   0.059499  -4.162 3.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2707.7  on 3685  degrees of freedom
## Residual deviance: 2379.0  on 3629  degrees of freedom
## AIC: 2493
## 
## Number of Fisher Scoring iterations: 6

Here in the above summary result of the Logistic regression, we can find that there are so many insignificant variables in this whole complete model with all the variables as the regressors. So, we will do the model selection for choosing the best model with lower AIC value in comparion.

Model selection(backward stepwise selection method)

## Start:  AIC=2493.01
## INCOME_CAT ~ MATHscore_9th_grade + LOCAL + REGION + NativeLanguage + 
##     MOM_education + Dad_education + Family_Income + SEX + RACE + 
##     AllAcademic_grade_12_GPA + SES + Math_grade_12_GPA + STEM_grade_12_GPA + 
##     PC1 + PC2 + PC3
## 
##                            Df Deviance    AIC
## - Family_Income            12   2394.7 2484.7
## - Dad_education             7   2385.3 2485.3
## - MATHscore_9th_grade       1   2379.2 2491.2
## - NativeLanguage            2   2381.3 2491.3
## - STEM_grade_12_GPA         1   2379.6 2491.6
## - SES                       1   2380.0 2492.0
## - LOCAL                     3   2384.7 2492.7
## <none>                          2379.0 2493.0
## - PC1                       1   2381.2 2493.2
## - Math_grade_12_GPA         1   2383.2 2495.2
## - AllAcademic_grade_12_GPA  7   2395.7 2495.7
## - PC2                       1   2383.9 2495.9
## - MOM_education             7   2399.8 2499.8
## - REGION                    3   2393.0 2501.0
## - RACE                      7   2404.1 2504.1
## - PC3                       1   2396.2 2508.2
## - SEX                       1   2418.6 2530.6
## 
## Step:  AIC=2484.65
## INCOME_CAT ~ MATHscore_9th_grade + LOCAL + REGION + NativeLanguage + 
##     MOM_education + Dad_education + SEX + RACE + AllAcademic_grade_12_GPA + 
##     SES + Math_grade_12_GPA + STEM_grade_12_GPA + PC1 + PC2 + 
##     PC3
## 
##                            Df Deviance    AIC
## - Dad_education             7   2405.6 2481.6
## - NativeLanguage            2   2396.6 2482.6
## - MATHscore_9th_grade       1   2394.8 2482.8
## - SES                       1   2395.0 2483.0
## - STEM_grade_12_GPA         1   2395.2 2483.2
## <none>                          2394.7 2484.7
## - LOCAL                     3   2400.7 2484.7
## - PC1                       1   2396.7 2484.7
## - Math_grade_12_GPA         1   2398.8 2486.8
## - PC2                       1   2399.7 2487.7
## - AllAcademic_grade_12_GPA  7   2413.0 2489.0
## - REGION                    3   2409.5 2493.5
## - MOM_education             7   2418.4 2494.4
## - RACE                      7   2419.9 2495.9
## - PC3                       1   2412.3 2500.3
## - SEX                       1   2434.3 2522.3
## 
## Step:  AIC=2481.62
## INCOME_CAT ~ MATHscore_9th_grade + LOCAL + REGION + NativeLanguage + 
##     MOM_education + SEX + RACE + AllAcademic_grade_12_GPA + SES + 
##     Math_grade_12_GPA + STEM_grade_12_GPA + PC1 + PC2 + PC3
## 
##                            Df Deviance    AIC
## - NativeLanguage            2   2407.5 2479.5
## - MATHscore_9th_grade       1   2405.9 2479.9
## - STEM_grade_12_GPA         1   2406.1 2480.1
## <none>                          2405.6 2481.6
## - PC1                       1   2407.7 2481.7
## - LOCAL                     3   2413.3 2483.3
## - Math_grade_12_GPA         1   2409.7 2483.7
## - PC2                       1   2410.9 2484.9
## - SES                       1   2411.3 2485.3
## - AllAcademic_grade_12_GPA  7   2426.2 2488.2
## - REGION                    3   2421.6 2491.6
## - MOM_education             7   2432.0 2494.0
## - RACE                      7   2432.4 2494.4
## - PC3                       1   2425.1 2499.1
## - SEX                       1   2444.0 2518.0
## 
## Step:  AIC=2479.49
## INCOME_CAT ~ MATHscore_9th_grade + LOCAL + REGION + MOM_education + 
##     SEX + RACE + AllAcademic_grade_12_GPA + SES + Math_grade_12_GPA + 
##     STEM_grade_12_GPA + PC1 + PC2 + PC3
## 
##                            Df Deviance    AIC
## - MATHscore_9th_grade       1   2407.8 2477.8
## - STEM_grade_12_GPA         1   2407.9 2477.9
## <none>                          2407.5 2479.5
## - PC1                       1   2409.8 2479.8
## - LOCAL                     3   2414.9 2480.9
## - Math_grade_12_GPA         1   2411.7 2481.7
## - PC2                       1   2412.5 2482.5
## - SES                       1   2413.7 2483.7
## - AllAcademic_grade_12_GPA  7   2427.9 2485.9
## - REGION                    3   2423.3 2489.3
## - MOM_education             7   2432.8 2490.8
## - RACE                      7   2433.1 2491.1
## - PC3                       1   2426.8 2496.8
## - SEX                       1   2446.3 2516.3
## 
## Step:  AIC=2477.79
## INCOME_CAT ~ LOCAL + REGION + MOM_education + SEX + RACE + AllAcademic_grade_12_GPA + 
##     SES + Math_grade_12_GPA + STEM_grade_12_GPA + PC1 + PC2 + 
##     PC3
## 
##                            Df Deviance    AIC
## - STEM_grade_12_GPA         1   2408.3 2476.3
## <none>                          2407.8 2477.8
## - PC1                       1   2410.2 2478.2
## - LOCAL                     3   2415.5 2479.5
## - Math_grade_12_GPA         1   2411.9 2479.9
## - PC2                       1   2412.8 2480.8
## - SES                       1   2414.2 2482.2
## - AllAcademic_grade_12_GPA  7   2428.9 2484.9
## - REGION                    3   2423.7 2487.7
## - MOM_education             7   2433.3 2489.3
## - RACE                      7   2433.3 2489.3
## - PC3                       1   2428.5 2496.5
## - SEX                       1   2446.5 2514.5
## 
## Step:  AIC=2476.29
## INCOME_CAT ~ LOCAL + REGION + MOM_education + SEX + RACE + AllAcademic_grade_12_GPA + 
##     SES + Math_grade_12_GPA + PC1 + PC2 + PC3
## 
##                            Df Deviance    AIC
## <none>                          2408.3 2476.3
## - PC1                       1   2410.7 2476.7
## - LOCAL                     3   2415.9 2477.9
## - Math_grade_12_GPA         1   2412.4 2478.4
## - PC2                       1   2413.4 2479.4
## - SES                       1   2414.8 2480.8
## - REGION                    3   2424.1 2486.1
## - RACE                      7   2433.7 2487.7
## - MOM_education             7   2433.7 2487.7
## - AllAcademic_grade_12_GPA  7   2434.3 2488.3
## - PC3                       1   2429.2 2495.2
## - SEX                       1   2446.5 2512.5

Here, this backward stepwise selection method helped us to know the actual variables that cause the significant variance in the dependent variable. And, we can see that at sixth step we get the model with lower AIC in comparison and variables that are actually causing the significant variance in dependent variable INCOME. We will now again perform the glm() function to obtain the further model evaluations.

## 
## Call:
## glm(formula = INCOME_CAT ~ REGION + MOM_education + SEX + RACE + 
##     AllAcademic_grade_12_GPA + SES + Math_grade_12_GPA + PC1 + 
##     PC2 + PC3, family = "binomial", data = df_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5037  -0.5435  -0.3922  -0.2526   3.0459  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.72575    1.15175  -2.367 0.017952 *  
## REGION2                      0.70277    0.18581   3.782 0.000155 ***
## REGION3                      0.43798    0.18093   2.421 0.015490 *  
## REGION4                      0.56778    0.20384   2.785 0.005345 ** 
## MOM_education1              -0.37054    0.29109  -1.273 0.203031    
## MOM_education2              -0.19637    0.20385  -0.963 0.335389    
## MOM_education3              -0.49428    0.31656  -1.561 0.118426    
## MOM_education4               0.17185    0.22597   0.760 0.446966    
## MOM_education5              -0.56644    0.24655  -2.297 0.021594 *  
## MOM_education6               0.07335    0.28782   0.255 0.798851    
## MOM_education7              -1.08682    0.57159  -1.901 0.057249 .  
## SEX2                        -0.69412    0.11283  -6.152 7.65e-10 ***
## RACE2                        0.33344    1.07766   0.309 0.757011    
## RACE3                        0.68826    1.05680   0.651 0.514871    
## RACE4                       -0.07832    1.48460  -0.053 0.957927    
## RACE5                        1.17858    1.04979   1.123 0.261573    
## RACE6                        1.13774    1.05536   1.078 0.281008    
## RACE7                        0.68094    1.49803   0.455 0.649427    
## RACE8                        1.32923    1.04415   1.273 0.203009    
## AllAcademic_grade_12_GPA1   -1.06913    0.57524  -1.859 0.063087 .  
## AllAcademic_grade_12_GPA1.5 -0.42397    0.47916  -0.885 0.376256    
## AllAcademic_grade_12_GPA2   -0.63153    0.48365  -1.306 0.191640    
## AllAcademic_grade_12_GPA2.5 -0.66165    0.50228  -1.317 0.187740    
## AllAcademic_grade_12_GPA3   -0.99812    0.53726  -1.858 0.063199 .  
## AllAcademic_grade_12_GPA3.5 -1.56670    0.58539  -2.676 0.007444 ** 
## AllAcademic_grade_12_GPA4   -1.86387    0.65473  -2.847 0.004416 ** 
## SES                         -0.29372    0.11096  -2.647 0.008118 ** 
## Math_grade_12_GPA            0.24873    0.12520   1.987 0.046955 *  
## PC1                         -0.09123    0.05677  -1.607 0.108045    
## PC2                         -0.12087    0.05209  -2.320 0.020324 *  
## PC3                         -0.27706    0.05772  -4.800 1.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2707.7  on 3685  degrees of freedom
## Residual deviance: 2415.9  on 3655  degrees of freedom
## AIC: 2477.9
## 
## Number of Fisher Scoring iterations: 6

Exponetial coefficients can also be checked for log odds. Regression coefficient describes the size and direction of the relationship between a predictors and the response variables.

##                 (Intercept)                     REGION2 
##                  0.06549724                  2.01933788 
##                     REGION3                     REGION4 
##                  1.54956960                  1.76435156 
##              MOM_education1              MOM_education2 
##                  0.69035841                  0.82170984 
##              MOM_education3              MOM_education4 
##                  0.61000860                  1.18749402 
##              MOM_education5              MOM_education6 
##                  0.56754474                  1.07610410 
##              MOM_education7                        SEX2 
##                  0.33728573                  0.49951501 
##                       RACE2                       RACE3 
##                  1.39575914                  1.99025296 
##                       RACE4                       RACE5 
##                  0.92466775                  3.24976124 
##                       RACE6                       RACE7 
##                  3.11969926                  1.97573735 
##                       RACE8   AllAcademic_grade_12_GPA1 
##                  3.77814975                  0.34330642 
## AllAcademic_grade_12_GPA1.5   AllAcademic_grade_12_GPA2 
##                  0.65444655                  0.53177823 
## AllAcademic_grade_12_GPA2.5   AllAcademic_grade_12_GPA3 
##                  0.51600027                  0.36857285 
## AllAcademic_grade_12_GPA3.5   AllAcademic_grade_12_GPA4 
##                  0.20873367                  0.15507105 
##                         SES           Math_grade_12_GPA 
##                  0.74548600                  1.28239996 
##                         PC1                         PC2 
##                  0.91280481                  0.88614731 
##                         PC3 
##                  0.75801263

We can see that the exponetial coefficients for the independent categorical variable REGIONS2, REGIONS3, REGIONS4 are high among all other variables so prediction chances of higher and lower income of students, when they will start working can be categorized on the basis of region. The student in REGION 2 are having high chances of having high salaraies in comparison to students of REGION4.

Model evaluation: McFadden

##           llh       llhNull            G2      McFadden          r2ML 
## -1.207960e+03 -1.353840e+03  2.917603e+02  1.077529e-01  7.610203e-02 
##          r2CU 
##  1.462670e-01

The pseudo R2 value is similar to the multi R-quare value as in linear regression for model’s maximum likelihood estimation. Here its coming out to be 10% which is not bad for this social science data where we are targetting people.

Model evaluation: ROC curve and Area-Under-Curve

The AUC ROC curve is an important evaluation metrics for checking any classifcaion model’s performance and can be said as Area Under the Receiver Operating Characteristics.

We can see that AUC value is coming 0.738 which is almost near 0.8 and is good for our kind od social science data set. We can accept it as good value for model validation and for predicting probability that student’s income belongs to REGION variable as predictor.

Conclusion

In conclusion, we find three underlying structures from PCA: parents involvement, students perception of teacher support, and peer influence. And we include these three factors in the attending college and graduate income analysis. The results indicate essential factors that influence student whether attend college are: Students Math score, mother education, family’s socioeconomic status, gender, STEM GPA, and peer’s performance. And essential factors that influence student income when they start working are Region, mother’s education, gender, socioeconomic status, final academic GPA of high school, teacher and peer’s performance. By analyzing the common important factors for the two research questions, we can summarize that family socioeconomic status, mothers education level, and peer’s performance is crucial in a student’s life.

Limitation

There are some limitations to our project. Our data is cross-sectional, it’s hard to find causal assumption, so it is hard to determine the direction of influence using cross-sectional data. And our data is self-report data. There may contain social desirability bias when the respondent answers the question. Besides, there are some data is not accessible for the public. If we can get such kind of data from the organization, our analysis can be improved.

Bibliography

Hair, J. F. (2006). Multivariate data analysis. Pearson Education India.

Mckay.D.R.(2019), Should I Go to College? The balance careers.Access at:< https://www.thebalancecareers.com/should-you-go-to-college-525564>

Bhalla.D.,Random forest in R: step by step tutorial.Listendata. Access at :https://www.listendata.com/2014/11/random-forest-with-r.html

Grisanti,J.,Decision Trees: An Overview. Aunalytics.Access at:< https://www.aunalytics.com/2015/01/30/decision-trees-an-overview/>

Socioeconomic Status and College: How SES Affects College Experiences and Outcomes https://muse.jhu.edu/article/46608/summary

The Impact of Parents’ Education Level on College Students: An Analysis Using the Beginning Postsecondary Students Longitudinal Study 1990-92/94 https://muse.jhu.edu/article/173980/summary

High school students and their future predictions

Hindwan Tanvi, Duan Xuejing, Chen Chen, Sharma Jyoti

April 27, 2019

Chapter 1: Introduction

Chapter 2:Statistic analysis

Principal Component Analysis (PCA)

SMART QUESTION 1: What factors determine whether students go to college or not?

Data cleaning

Discriptive analysis

Lasso(least absolute shrinkage and selection operator)

Logistic regression

Model evaluation

Random Forest Model

Classification tree

Comparison

SMART Question 2: What are the responsible factors for income, when students start working?

Data cleaning

Logistic Regression

Model selection(backward stepwise selection method)

Model evaluation: McFadden

Model evaluation: ROC curve and Area-Under-Curve

Conclusion

Limitation

Bibliography