Wed 15 Nov

To analyze relationships between variables within a dataset containing economic indicators i have calculated the correlation coefficients and visualized them through a heatmap.
First computed a correlation matrix, which details the pairwise correlations among all columns in the DataFrame. Each cell in the matrix represented the correlation coefficient between two variables, signifying the strength and direction of their linear relationship. Correlation coefficients range from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear relationship.
Using sns.heatmap() function generated a heatmap using the correlation matrix data obtained. The heatmap offered a visual representation of correlations, employing a color spectrum to depict the strength of relationships. Annotations within each cell displayed the correlation coefficient values, aiding in the interpretation of the heatmap.
The resulting heatmap served a powerful visualization tool, enabling quick identification of strong positive or negative correlations (values closer to 1 or -1) and weak correlations (values closer to 0). This visual representation assisted me in understanding complex relationships between economic indicators, facilitating informed decision-making processes such as feature selection, identifying multicollinearity, or guiding further analysis and modeling.

Mon 13 Nov

I have chosen the economics indicators dataset.
I have first loaded the economic indicators dataset to comprehend the dataset’s structure and content. This step is crucial as it offers an initial understanding of the information before further analysis.
Later I have visualised trends in conomic indicators over time using line plots by utilizing Matplotlib to create a visual representation of multiple indicators simultaneously. It iterates efficiently through chosen indicators, plotting their trends against time (represented by ‘Year-Month’), aiding in the identification of potential correlations or patterns.
Then calculated the summary statistics for the economic indicators by using Pandas’ describe() and individual statistical functions (e.g., mean(), median(), std()) to generate insightful statistical summaries. These statistics provided an overall understanding of the dataset’s central tendencies and variability across different economic indicators.

Fri 10 Nov

Having performed the DBSCAN algorithm, I implemented OPTICS which extends DBSCAN algorithm.
Preprocessed the data by normalising/ standardising the data and converting the object values to numerical ones.
Used an Haversine distance formula to calculate the distance between each location.
Later applied the OPTICS algorithm:
optics = OPTICS(min_samples=5, metric=’precomputed’)
optics.fit(haversine_matrix)
Identified the noise points and extracted cluster information such as core samples and reachability distances.
Also visualised the clusters on the map to understand the spatial distribution.

Mon 6 nov

Hierarchical clustering is a valuable technique for analyzing geospatial data that includes latitude and longitude variables.
Since our data has these, I calculated the distance between them using distance metric, such as Euclidean to measure the dissimilarity between locations. Using this metric, calculated a pairwise distance matrix representing the differences between all pairs of locations. Then applied an agglomerative hierarchical clustering algorithm, coupled with a linkage method of choice to the distance matrix. The outcome was a dendrogram, visually displaying the hierarchical structure of clusters.
interpreted and analysed the results of spatial patterns within the identified clusters and investigated the geospatial implications of the clustering.

Fri 3 Nov

K-Means and DBSCAN are two clustering algorithms:

K-Means:
Partition-based clustering.
Requires the number of clusters (K) to be specified beforehand.
Assigns data points to the nearest cluster centroid.
Sensitive to initial centroid placement.
Performs hard clustering (each point belongs to one cluster).
Assumes spherical clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Density-based clustering.
Automatically finds clusters based on data density, no need to specify K.
Identifies clusters of arbitrary shapes.
Handles noise/outliers.
Accommodates variable cluster sizes.
Primarily performs hard clustering, but soft clustering can be achieved using extensions.

For the project i have used DBSCAN clustering algorithm.

 

Wed 1 Nov

Logistic regression is a statistical modelling technique used to analyse the relationship between a binary outcome variable and one or more predictor variables.
I have taken “manner_of_death” as the binary outcome variable and the other columns as predictor variables. The “manner_of_death” column indicates whether the death was “shot” or “shot and tasered,” and other columns, like “armed,” “age,” “gender,” and “race,” may be used as predictor variables to estimate the risk of a particular mode of death.
Then i have explored the data and preprocessed it by handling missing values, encoding the categorical variables. Later built a logistic regression model by fitting the data. The binary variable is the dependent variable and the predictor variable is the independent data.
Evaluated the model’s performance using accuracy, precision, recall, F1 score, R2 score which were all satisfactory.

Mon 30 oct

Performed Decision tree method ( Random Forest) to the data considering latitude, longitude and the target value (id) to predict.
Then calculated the distance between each point.
Later split the data into train and split for evaluating the performance of the model on the unseen data.
Assessed the spatial auto correlation in the data to understand if there any spacial patterns that needs to be considered in the modelling process.
Then adjusted the number of trees, tree depth and gradient boosting parameters and trained the model on the training dataset.
The model will learn to capture the spatial and non-spatial patterns in the data.
The accuracy of the model was 67%.
Applied the trained model on the test data and created spatial maps of predicted outcomes.

Fri 27 Oct

Having performed both the k means clustering  and DBSCAN algorithm on the data provided, I noticed that both operate on different principles and have distinct characteristics.

K-means is a partitioning technique divides data points into K clusters by calculating the least squared distance between each point and the cluster centre. DBSCAN is a density-based method that detects outliers in low-density areas while clustering nearby data points in high-density regions. 
K-means makes the spherical, roughly equal-size cluster assumption and need to specify the number of clusters before running the algorithm. This means that when dealing with clusters that have different densities, sizes, or irregular shapes, it might not perform effectively.DBSCAN is more resilient when handling clusters with irregular shapes since it can recognise clusters of any shape and automatically determines the number of clusters based on data’s density structure. There is no set cluster shape that it must adhere to.

DBSCAN algorithm efficiently recognises and classifies outliers as noise. It works well with our dataset because it is resistant against noise and does not allocate noise points to individual clusters. Whereas K-means does not specifically address the noise or outliers.
K-means is appropriate for large datasets with modest dimensions and can be computationally efficient. It may, however, be influenced by the original cluster centres selected. DBSCAN can handle datasets with different densities and is less sensitive to the initial parameters. It is less effective for high-dimensional data, though, as its complexity increases with dataset size and dimensionality.