Resubmitted: MTH Project 1
MTH 522 project -3
MTH522 PROJECT 2 REPORT
Gender disparity
In this analysis, we focused on the armed status of individuals who were shot, utilizing a dataset represented by the variable ‘data.’ To ensure the accuracy of our analysis, we removed any rows where information on the armed status was not available (NaN values).
Subsequently, we tallied the occurrences of each armed status category using the ‘value_counts()’ function, producing a count for each distinct armed status in the dataset.
The results were visualized through a bar chart, generated using the ‘matplotlib’ library, with the x-axis representing different armed statuses and the y-axis indicating the frequency of occurrences. The chart was customized with a title, labels for both axes, and proper rotation for better readability of the x-axis labels.
The chosen color scheme for the bars was derived from the ‘tab10’ colormap. The final plot provides a clear overview of the distribution of armed statuses among individuals who were shot, offering insights into the prevalence of each category.
In this analysis individuals were fleeing from the police using a dataset labeled as ‘data.’ To ensure the accuracy of our examination, we first removed any rows where information about the fleeing status was not available (NaN values).
Then, we proceeded to count the occurrences of each fleeing status category through the ‘value_counts()’ function, generating a count for each unique fleeing status in the dataset.
The findings were visually presented using a bar chart, crafted with the ‘matplotlib’ library. The x-axis of the chart illustrates different fleeing statuses, while the y-axis indicates the frequency of occurrences. To enhance readability, the chart includes a title, labels for both axes, and proper rotation of the x-axis labels.
The color palette chosen for the bars was derived from the ‘tab10’ colormap. The resulting plot offers a clear representation of the distribution of fleeing statuses among individuals involved in police shootings, shedding light on the prevalence of each category.
Certainly! In this analysis, we delved into whether individuals exhibited signs of mental illness using a dataset denoted as ‘data.’ To ensure the reliability of our examination, we initially removed any rows where information regarding signs of mental illness was not available (NaN values).
Subsequently, we tabulated the occurrences of each mental illness status category through the ‘value_counts()’ function, producing a count for each distinct status in the dataset.
The results were visually communicated via a bar chart, crafted with the ‘matplotlib’ library. The x-axis of the chart illustrates different mental illness statuses, while the y-axis denotes the frequency of occurrences. For clarity, the chart features a title, labels for both axes, and appropriate rotation of the x-axis labels.
The color scheme chosen for the bars was derived from the ‘tab10’ colormap. The resultant plot provides a visual insight into the distribution of mental illness statuses among individuals involved in the context under consideration, offering a glimpse into the prevalence of each category.
The bar plot indicates that the signs of mental illness in the individual shot may also contribute to the occurrence of fatal police shootings in some cases. However, individuals exhibiting no signs of mental illness were more likely to be shot by police compared to the one with the illness.
Fatal police shootings predominantly involved men aged between 25 and 35 who were armed with either a gun or a knife and did not exhibit signs of mental illness. The majority of these individuals were not fleeing from the police, and the prevalent racial demographic was white.
Geographically, the incidents were concentrated primarily in California, with Los Angeles emerging as the top city in this context. Other states, such as Texas and Florida, also experienced these incidents, with a relatively consistent but varying frequency. Notably, California exhibited the highest occurrence, while Texas and Florida, along with several other states, followed the trend, showing an average fluctuation of approximately -180, reaching a minimum at 250.
On a city level, Phoenix secured the second position, while at the state level, Arizona, where Phoenix is situated, held the second position. In the third position at the city level was Houston, while at the state level, Texas, where Houston is located, secured the second position. This analysis sheds light on the patterns and distribution of fatal police shootings, emphasizing demographic and geographic factors.
Washington data analysis
It undertakes a comprehensive analysis of demographic factors, including age distribution, race, mental health conditions, gender, and other pertinent variables, within the dataset comprising individuals involved in police shootings. The data utilized for this analysis is sourced from the Washington Post Police Shootings Database. The primary aim of this report is to illuminate the age demographics of individuals impacted by police violence, offering a meticulous and insightful analysis of the findings.
The foundation of this analysis rests upon the Washington Post Police Shootings Database, encompassing data pertaining to incidents of police shootings in the United States spanning the years 2015 to 2023. To ensure the integrity of our analysis, any absent age values were substituted with the dataset’s mean age, and meticulous measures were implemented to address NaN and null values across all other columns. These preprocessing steps were crucial in preparing the dataset for subsequent visualization and analysis.
We use the Python code that utilizes the pandas and matplotlib libraries for dataset handling and graphical representation, respectively. The Washington Post Police Shootings Database, in CSV format, is loaded into a pandas DataFrame named ‘data.’ Subsequently, the analysis focuses on examining trends over time, particularly the number of fatal police shootings each year.
To ensure data integrity, rows with NaN values in the ‘date’ column are removed. The ‘date’ column is then converted to a datetime format, and the corresponding years are extracted and stored in a new column named ‘year.’ The number of fatal police shootings per year is computed using the groupby function, and a line plot is generated using matplotlib.
– Loading the CSV file into a pandas DataFrame
-Dropping rows with NaN values in the ‘date’ column for data integrity
-Converting ‘date’ to datetime and extracting the year
– Calculating the number of fatal police shootings per year
– Creating a line plot
Findings
The examination of the plotted graph reveals a notable trend in the number of fatal police shootings over the years. Between 2016 and 2022, there was a pronounced and consistent increase in the count of such incidents. However, a conspicuous anomaly is observed in the year 2023, where a substantial decline in the count is evident. This abrupt shift prompts a cautious interpretation, raising the possibility of missing data or inconclusive details regarding the circumstances leading to the police shootings. It is essential to consider the potential factors contributing to this unexpected decrease and exercise prudence in drawing definitive conclusions about the incidents during the year 2023.
The following Python code examines the relationship between race and the number of individuals shot, employing the pandas and matplotlib libraries for data manipulation and visualization. Rows with NaN values in the ‘race’ column are removed to ensure data integrity. Subsequently, the count of people shot for each race is calculated, and a bar chart is generated to visually represent the distribution.
–
-Dropping rows with NaN values in the ‘race’ column for data integrity
– Counting the number of people shot for each race
– Creating a bar chart
– Rotating x-axis labels for readability
– Adjusting the layout to prevent overlapping
The provided Python code addresses the handling of missing or null values in the ‘age’ column by filling them with the median age. Subsequently, the code calculates the occurrences of each unique age, extracts the unique age values, and generates a scatter plot depicting the distribution of ages of individuals shot by the police.
– Filling NaN values or null values in the ‘age’ column with the median
– Counting the occurrences of each unique age
– Extracting the unique age values
– Creating a scatter plot using the unique age values and their counts
The following Python code conducts an analysis on the gender distribution of individuals involved in fatal police shootings. It utilizes the pandas library to handle the dataset and matplotlib for graphical representation. Rows with NaN values in the ‘gender’ column are dropped to ensure data integrity. The code then calculates the number of people shot for each gender and generates a bar chart to illustrate the gender distribution.
– Dropping rows with NaN values in the ‘gender’ column for data integrity
– Counting the number of people shot for each gender
-Creating a bar chart
– Rotating x-axis labels for readability
– Adjusting the layout to prevent overlapping
The following Python code examines fatal police shootings based on geographical locations, specifically focusing on cities and states. Utilizing the pandas library for data manipulation and matplotlib for visualization, the code groups the data by city and state, counts the number of shootings for each, and extracts the top 10 cities and states with the highest number of fatal police shootings. Bar charts are then generated to illustrate these findings.
– Grouping the data by city and counting the number of shootings for each city
– Extracting the top 10 cities with the highest number of shootings
– Creating a bar chart for the top 10 cities
– Grouping the data by state and counting the number of shootings for each state
– Extracting the top 10 states with the most shootings
– Creating a bar chart for the top 10 states.
MTH522 PROJECT 1 REPORT
Cross Validation
Cross Validation and Bootstrapping are resampling methods primarily employed for assessing test-set predictions. Cross Validation is utilized to estimate the test-set prediction error, while Bootstrapping helps determine standard deviation and bias parameters. Together, the combination of bias and variance from Bootstrapping contributes to understanding the overall prediction error.
In addition to Cross Validation and Bootstrapping, other techniques such as validation sets are often employed to achieve the best possible model performance. In the validation set approach, the dataset is randomly divided into two halves: a training set and a validation set. The training data is used to train the model, and then its predictive performance is evaluated on the validation set. This assessment is typically based on metrics like mean square error, providing an estimation of the model’s test error.
K-fold cross-validation represents a specific variant of Cross Validation. In K-fold cross-validation, the dataset is partitioned into “k” parts or folds, allowing the model to be trained and tested “k” times, with each fold serving as a test set once. This technique offers a robust means to assess model performance by mitigating issues such as overfitting and providing a more comprehensive evaluation of how well the model generalizes to unseen data.
Pre-molt and Post-molt
We are analyzing whether there is a significant difference between pre-molt and post-molt crab sizes as part of our research. The data show a huge variation in the values of kurtosis, with post-molt kurtosis rise to an astonishing 13.116 and pre-molt kurtosis being relatively low at 9.76632. The two groups have surprisingly similar shapes when we compare the actual sizes of the crabs; the main difference between the two groups’ means is a difference of 14.6858.
Determining whether the observed difference in mean size is a real phenomena or just a statistical aberration is our main challenge. Our first inclination is to use the tried-and-true T-test to answer this question.The T-test, however, is predicated on the notion that data follow a normal distribution, which is dubious in our case given the high values of kurtosis. In light of this, we suggest the Monte Carlo permutation test as an alternate strategy that can gracefully accommodate the non-normality of our data. In order to determine whether the size of pre-molt and post-molt crabs differs significantly, we converge the two datasets. Ten million times, randomly split the combined data into two groups of equal size.For each division, determine the mean differences.The mean differences should be distributed. Use this distribution to comprehend how likely it is, under the null hypothesis, to observe a mean difference as extreme as the difference in our actual data. This will be shown as a curve of the permuted mean differences (p) in relation to the total number of permutations (N).
Linear Regression and Multilinear Regression
The linear regression model suggests that we can describe one variable (the dependent variable) based on the values of another variable (the independent variable). In our dataset, we have identified three key parameters: diabetes, inactivity, and obesity. Consequently, the variables of interest are represented as percentages of diabetes, inactivity, and obesity. As covered during the class, when determining the percentage of diabetes, we employed the percentage of inactivity as the independent variable, resulting in the equation % diabetes = α + β % inactivity + ε. Similarly, we can extend this approach to multiple linear regression by incorporating two independent variables: the percentage of inactivity and the percentage of obesity. The equation for this extended model would be % diabetes = α + β1 % inactivity + β2 % obesity + ε.
If the Kurtosis is positive then the data exhibits more outliers, whereas it is negative then it exhibits less outliers than the normal distribution. The heteroscedastic occurs when the variance of the data varies widely with more number of outlires.