Cross Validation

Cross Validation and Bootstrapping are resampling methods primarily employed for assessing test-set predictions. Cross Validation is utilized to estimate the test-set prediction error, while Bootstrapping helps determine standard deviation and bias parameters. Together, the combination of bias and variance from Bootstrapping contributes to understanding the overall prediction error.

In addition to Cross Validation and Bootstrapping, other techniques such as validation sets are often employed to achieve the best possible model performance. In the validation set approach, the dataset is randomly divided into two halves: a training set and a validation set. The training data is used to train the model, and then its predictive performance is evaluated on the validation set. This assessment is typically based on metrics like mean square error, providing an estimation of the model’s test error.

K-fold cross-validation represents a specific variant of Cross Validation. In K-fold cross-validation, the dataset is partitioned into “k” parts or folds, allowing the model to be trained and tested “k” times, with each fold serving as a test set once. This technique offers a robust means to assess model performance by mitigating issues such as overfitting and providing a more comprehensive evaluation of how well the model generalizes to unseen data.

Pre-molt and Post-molt

We are analyzing whether there is a significant difference between pre-molt and post-molt crab sizes as part of our research. The data show a huge variation in the values of kurtosis, with post-molt kurtosis rise to an astonishing 13.116 and pre-molt kurtosis being relatively low at 9.76632. The two groups have surprisingly similar shapes when we compare the actual sizes of the crabs; the main difference between the two groups’ means is a difference of 14.6858.

Determining whether the observed difference in mean size is a real phenomena or just a statistical aberration is our main challenge. Our first inclination is to use the tried-and-true T-test to answer this question.The T-test, however, is predicated on the notion that data follow a normal distribution, which is dubious in our case given the high values of kurtosis. In light of this, we suggest the Monte Carlo permutation test as an alternate strategy that can gracefully accommodate the non-normality of our data. In order to determine whether the size of pre-molt and post-molt crabs differs significantly, we converge the two datasets. Ten million times, randomly split the combined data into two groups of equal size.For each division, determine the mean differences.The mean differences should be distributed. Use this distribution to comprehend how likely it is, under the null hypothesis, to observe a mean difference as extreme as the difference in our actual data. This will be shown as a curve of the permuted mean differences (p) in relation to the total number of permutations (N).

Linear Regression and Multilinear Regression

The linear regression model suggests that we can describe one variable (the dependent variable) based on the values of another variable (the independent variable). In our dataset, we have identified three key parameters: diabetes, inactivity, and obesity. Consequently, the variables of interest are represented as percentages of diabetes, inactivity, and obesity. As covered during the class, when determining the percentage of diabetes, we employed the percentage of inactivity as the independent variable, resulting in the equation % diabetes = α + β % inactivity + ε. Similarly, we can extend this approach to multiple linear regression by incorporating two independent variables: the percentage of inactivity and the percentage of obesity. The equation for this extended model would be % diabetes = α + β1 % inactivity + β2 % obesity + ε.

If the Kurtosis is positive then the data exhibits more outliers, whereas it is negative then it exhibits less outliers than the normal distribution. The heteroscedastic occurs when the variance of the data varies widely with more number of outlires.