fri 22 sep

My focus revolved around understanding the factors affecting obesity and inactivity rates, and took several key steps in this process.

I began by collecting comprehensive data on obesity, including its associated factors such as food, economic conditions, physical environment, and exercise. Implemented code to generate insightful histograms, calculated essential statistical measures such as mean, median, mode, and variance, and create bar graphs that link counties/states with obesity rates and their corresponding risk factors. This data visualisation helps us gain a clearer understanding of the factors influencing obesity across different regions.

Additionally, expanded the analysis to include time series data spanning the years 2006, 2010, 2014, and 2018. Utilising this data, created informative time series graphs that reveal trends and patterns in obesity and inactivity rates over time. This temporal analysis will be invaluable in identifying long-term changes and helping us make informed conclusions about these health-related issues.

Lastly, employed line graphs to further enhance our understanding of the data, providing a visual representation of the trends and correlations within the obesity and inactivity datasets.

 

wed 20 sept

Continuation of Monday’s update..

After overfitting was noticed in the model, I performed k-fold cross validation on the data using scikit-learn to evaluate the model. I divided the data into k(5) folds, then trained and evaluated the model 4 times, using a different fold as the validation set each time. In order to calculate the generalization performance of the model, performance measures from each fold are averaged. It does avoid overfitting to some extent following cross validation.

Today’s lecture followed t test and monte Carlo test.

The average values of two data sets are compared using a t-test to ascertain whether they represent the same population. Therefore, I performed the t test on diabetes from 2018 and 2016 dataset, to compare the mean of the two. The results had some fluctuations.

Monte Carlo test is a technique to predict the possible outcome of an uncertain event.Monte Carlo approach evaluates a lot of random variables before averaging them, as opposed to beginning with an average.

Mon 18 Sep

After today’s discussion on multiple linear regression, I ran the model on the 2018 dataset to predict diabetes using inactivity and obesity.

Used the formula/concept Y = A + B1 * X1 + B2 * X2 and to improve the model to predict better, I further added B3 * (X1 * X1) + B4 * X31 + B5 * X23 + …… Therefore the model could predict up to 56%.

Later, I tested the already-trained model (the one using data from the year 2018) with a portion of the 2016 dataset and found that it could only predict up to 43% of the time.

This indicates that the model tends to be overfitted. When a machine learning model predicts outcomes accurately for training data but not for new data, this undesired behavior is known as overfitting.

This model can be avoided by training the model with more data or by using the concept of cross-validation, which I’ll further discuss with the professor.

Friday 15 Sep

P values help us determine whether our analysis or theory is valid, by comparing the data to what we would expect if the null hypothesis were true. If the p value is very small it indicates that it’s highly unlikely to observe the data we have if the null hypothesis were true. This suggests that the null hypothesis is likely false and our analysis or theory may be correct.

A p-value of less than 0.05 is considered statistically significant, and the null hypothesis should be rejected. A p-value greater than 0.05 indicates that the deviation from the null hypothesis is not statistically significant and so the null hypothesis is not rejected.

A p-value of 0.001 means that if the null hypothesis was true, there would be a one-in-1,000 probability of seeing outcomes that were at least as extreme. As a result, the observer rejects the null hypothesis since either an extremely unusual data result was observed or the null hypothesis is false.

At last, to assess the importance of observational data, the p-value is used. There is always a chance that a correlation between two variables that have been identified by us is just a coincidence. A p-value estimate may help determine if the relationship seen is due to chance.

Wed 13 Sep

 

 

Although I ran a simple linear regression on two variables(inactivity, diabetes), there was a third(obesity). As a result, I read the chapter 3 ,section 2 of the text on multiple linear regression and ran the model with all three variables. I used the model to predict each factor(mainly diabetes) by splitting the test and train data 30:70 and checking the r2 score to confirm how well the model works, but the results were not satisfactory, so I split the data 50:50 and the r2 score was close to 1. Overall, the model’s performance has to be much improved.

I obtained the 2016 dataset from the CDC website in order to compare the summary statistics of the factors with the 2018 dataset. According to my observations, the risk factors (inactivity and obesity)  for diabetes  rose with fluctuations, as did the central tendencies with minor variations. To find out if there is a statistically significant association between the independent variable and the dependent variable for which there was little variation, I performed a hypothesis testing on a simple linear regression model using the datasets from both years. Later, I plotted a graph of the residuals that showed heteroscedasticity to be present. I used the white test and the Breusch-pagan test, however there are a few questions that I need to clarify with the instructor before I fully comprehend the issue.