Mon 30 oct

Performed Decision tree method ( Random Forest) to the data considering latitude, longitude and the target value (id) to predict.
Then calculated the distance between each point.
Later split the data into train and split for evaluating the performance of the model on the unseen data.
Assessed the spatial auto correlation in the data to understand if there any spacial patterns that needs to be considered in the modelling process.
Then adjusted the number of trees, tree depth and gradient boosting parameters and trained the model on the training dataset.
The model will learn to capture the spatial and non-spatial patterns in the data.
The accuracy of the model was 67%.
Applied the trained model on the test data and created spatial maps of predicted outcomes.

Fri 27 Oct

Having performed both the k means clustering  and DBSCAN algorithm on the data provided, I noticed that both operate on different principles and have distinct characteristics.

K-means is a partitioning technique divides data points into K clusters by calculating the least squared distance between each point and the cluster centre. DBSCAN is a density-based method that detects outliers in low-density areas while clustering nearby data points in high-density regions. 
K-means makes the spherical, roughly equal-size cluster assumption and need to specify the number of clusters before running the algorithm. This means that when dealing with clusters that have different densities, sizes, or irregular shapes, it might not perform effectively.DBSCAN is more resilient when handling clusters with irregular shapes since it can recognise clusters of any shape and automatically determines the number of clusters based on data’s density structure. There is no set cluster shape that it must adhere to.

DBSCAN algorithm efficiently recognises and classifies outliers as noise. It works well with our dataset because it is resistant against noise and does not allocate noise points to individual clusters. Whereas K-means does not specifically address the noise or outliers.
K-means is appropriate for large datasets with modest dimensions and can be computationally efficient. It may, however, be influenced by the original cluster centres selected. DBSCAN can handle datasets with different densities and is less sensitive to the initial parameters. It is less effective for high-dimensional data, though, as its complexity increases with dataset size and dimensionality.

Mon 23 Oct

With the given data, i performed k-means clustering, which is a popular machine learning algorithm used for unsupervised clustering tasks. It is a partitioning algorithm that divides the dataset into k-clusters, where each datapoint belongs to a cluster with the nearest mean.
First determined the number of clusters using elbow method which is a graphical approach to determine the value of k by analysing the variation in the clustering performance as ‘k’ increases.
Later selected ‘k’ data points as the initial centroids.
Then calculated the distance between each datapoint and each centroid and assigned the data points to the cluster with the nearest centroid.
The final result i got is k clusters with their respective centroids.

Fri 20 Oct

In logistic regression, the coefficients represent the relationship between the independent variables and the log-odds of the dependent variable (binary outcome).The coefficients are estimated during the training of the logistic regression model. The logistic function (sigmoid function) is then applied to these log-odds to obtain the predicted probabilities.
In the context of geospatial data, logistic regression coefficients can be interpreted similarly to logistic regression in general, but with a spatial context. The logistic regression model will try to capture the relationship between the spatially distributed independent variables and the probability of an event occurring (binary outcome).
Since our data has both longitude and latitude, i used the formula
log-odds = B0 +B1 * LATITUDE + B2 * LONITUDE + … + Bn * LONGITUDE.
here,
log-odds is the natural logarithm of the odds of the event occurring.
B0 is the intercept
B1 and B2 are the coefficients associated with latitude and longitude.

The coefficients are estimated during the training of the logistic regression model. The logistic regression model would then predict the probability of an event occurring at different locations in your geospatial dataset.

In the code, used the longitude and latitude variables to predict if the event occurs. Later, trained the logistic regression model and coefficients were displayed.

Wed 18 oct

Based on the data analysis, here is a summary of the key findings:

  • Flee Variable:
    • There are four categories in the “flee” variable: not fleeing, car, foot, and other.
    • More than 4000 people did not flee after the attack.
    • Around 1300 people fled, with some using cars or on foot.
  • Manner of Death:
    • More than 7000 people were shot.
    • The remaining 8000 were shot and tasered.
  • Gender Distribution:
    • Males were more involved in the crimes compared to females.
  • Race Distribution:
    • People from different races were involved, with whites and blacks being the most common.
  • Geographical Distribution:
    • Three states had a higher incidence of the reported crimes.
  • Signs of Mental Illness:
    • Around 2000 people had signs of mental illness.
  • Threat Level:
    • Approximately 4000 people were reported to have attacked.
  • Association Between Mental Illness and Fleeing:
    • Among those with mental illness, around 1300 chose not to flee, while fewer than 200 fled on foot or by car.

Mon 16 Oct

The data contains latitude and longitude variables which is known as geo position data, from which insights can be extracted and later analysed. From the data containing longitude and latitude variables, location of where the shooting took place can be analysed.
Later on, Geodesic distance can be used to find the distance between the each longitude and latitude co-ordinates of one place to another. I have used haversine formula to find the geodesic distance that is reasonably accurate estimate of the shortest distance between the two points.
Then created Geolist plot using matplotlib by including all the longitude and latitude co-ordinates.
After analysing the visualisation, performed a clustering algorithm, KNN to group points that are close to each other on the map. Also created heatmap to identify the regions with low and high concentration of data points.
Since the geolist plot had time stamps, analysed the distribution of points over times.
Another clustering algorithm, DBSCAN (Density based spatial clustering of applications with noise) is used for grouping spatial data points based on their density. It used for discovering clusters with irregular shapes and handling outliers. Geo histogram is used for this specific data.
Next, i will find the outliers in the data and perform the DBSCAN algorithm.

Fri 13 Oct

Performed missing value prediction method using mean, median mode imputations.While this strategy is simple and quick, it is not always the ideal option, particularly when the missing data mechanism is not fully random. Therefore used the regression imputation to predict missing values based on other features in the dataset.
Also came to know how ANOVA test can be used to find the significant differences in the geosedic distances of multiple datapoints.

Wed 11 Oct

It was discovered through analysis of the “Washington Post Police Shooting Report” that there are around 8768 data points with 12 different factors. Because there are null values, the mean and median that I obtained are likely improper. I used the.describe() method, and the resulting mean, standard deviation, and values were 37.28 and 12.99, respectively. The 2693 null values were primarily in the following categories: name, age, gender, armed, city, and flee.

The anova test, which is used instead of the t test to determine whether there are statistically significant differences between the means of three or more groups, was also learned about today. Also discovered that geopy.geokernels may be used to calculate geodesic distance.

Mon 2 Oct

For robust model evaluation, I used K-fold cross-validation. However, I am aware of the bootstrapping method’s existence for this reason as well. However, in terms of performance estimation, I cannot perceive a significant difference between them.

 From my understanding,

Cross validation and bootstrapping are both resampling techniques.

Bootstrap resamples with replacement. A bootstrapped data collection may contain numerous occurrences of the same original cases and may completely miss other original cases due to the drawing with replacement.

Cross validation resamples without replacement, resulting in smaller surrogate data sets than the original. These data sets are created in such a way that after a certain number k of surrogate data sets, each of the n original cases is left out exactly once. This is referred to as k-fold cross validation or leave-x-out cross validation with x=n/k, whereas leave-one-out cross validation omits one instance for each surrogate set, i.e. k=n.

The fundamental goal of cross validation, as the name implies, is to measure performance of a model.

Bootstrapping, on the other hand, is typically used to establish actual distribution functions for a wide range of statistics.

In practise, there is generally no difference between iterated k-fold cross validation and bootstrap. Total error has been found to be similar with a similar total number of examined surrogate models, while bootstrap typically has higher bias and less variance than the corresponding CV estimates.