Wed 6 Dec

Implemented Simple Exponential Smoothing (SES) using the ‘statsmodels’ library for time series forecasting of passenger traffic (‘logan_passengers’ column) in the economic_indicators dataset.
Utilized SES, an exponential smoothing method to predict future passenger counts based on historical data.
Fitted the SES model to the ‘logan_passengers’ data.
The ‘fit’ method estimated the model’s parameters based on the given observations.
After the model was trained the ‘forecast’ function predicted the passenger counts for the subsequent time points.
SES is a basic yet effective method for forecasting time series data by assigning exponentially decreasing weights to past observations, assuming no underlying trend or seasonality.
After performing both SES and ARIMA models the comparision between them is that SES is a basic and straightforward model suitable for simple time series data without trends or seasonality. It’s easy to implement but lacks the ability to capture complex patterns. ARIMA, on the other hand, is a more comprehensive model that can handle various time series patterns by considering trends, seasonality, and irregularities. It offers greater flexibility but requires careful parameter selection and is computationally more intensive.

Mon 4 Dec

Used AR and LLR time series model to understand and predict future values based on past observations.
First, visualized the data using plots such as line plots, histograms, and autocorrelation plots to understand its patterns, trends, seasonality, and stationarity.
Then checked for stationarity in the data using statistical test like Augmented Dickey-Fuller (ADF) test.
Also determined the appropriate order (p) for the AR model by analyzing autocorrelation and partial autocorrelation functions (ACF and PACF).
ACF helps to identify the order of the moving average (MA) part, while PACF indicates the order of the autoregressive (AR) part.
Then fitted the AR model using the determined order (p) on the preprocessed and stationary time series data.
Analysed the AR model using a performance metric – Mean Squared Error.
Used residual analysis to check for any patterns or autocorrelation in the model residuals.
Once the model was validated, I used it to make future predictions or forecast values based on the trained model.

Wed 29 Nov

Performed time series analysis on ‘economic indicators’ data c to understand underlying patterns such as trends, seasonality, and cyclic behaviours.
The data was organised chronologically by time variable (‘Year’ and ‘Month’ columns).
Later, to indicate time, I combined the ‘Year’ and ‘Month’ columns into a single date time column.
Also visualised the time series data using time series plots to observe patterns, trends, and seasonality in each economic indicator over time.
Then, to analyse the specific contributions, I decomposed the time series into its components (trend, seasonality, residual) using techniques such as seasonal decomposition.
Seasonal decomposition is a time series analysis approach that divides a time series dataset into three components: trend, seasonality, and residual or error components.
Applied forecasting model like ARIMA (AutoRegressive Integrated Moving Average) to predict future values of economic indicators.
Then I split the data into training and test sets to train the model and evaluate its performance.
used the metric like Mean Absolute Error (MAE) to assess the accuracy of the forecasting model and compared the predicted values against the actual values in the test dataset to evaluate the model’s performance.

Mon 27 Nov

Performed clustering analysis  on the “economic indicators” dataset, which involved grouping similar instances together based on the data’s behaviour or trends.
First checked for missing or null values which were non.
The performed feature scaling and applied min-max scaling and standardisation on the columns of the data.
As for feature selection, removed the ‘year’ and ‘month’ columns, as they were not contributing significantly to the clustering process.
Later plotted an elbow curve to find the optimal number of clusters and performed k-means clustering algorithm on the data based on the optimal number of clusters.
For the elbow curve, used With in -clusters sum of squares(WCSS) to determine the performance of the clustering algorithm and determine the optimal number of clusters fr the given dataset.
The goal of K-means clustering is to divide the dataset into a predefined number of groups (K), with each data point belonging to the cluster with the closest mean (centroid). The sum of squared distances between each data point and the centroid of the cluster to which it is assigned is calculated by WCSS.
Performed the k-means clustering with the objective of reducing the WCSS. The WCSS tends to drop as the number of clusters grows because each cluster has fewer data points, lowering the distance between data points and their centroids.
The point at which the reduction in WCSS begins to level off (creating a “elbow”) denotes an ideal number of clusters.
From the plot, the elbow point is between 400 and 600 as the WCSS values starts to decrease at a slower rate.
Therefore the optimal number of clusters for the k-means algorithm for this dataset is 6.

 

Fri 24 Nov

On the economics indicators dataset performed the hypothesis test- Pearson correlation test between hotel occupancy rates and housing prices to determine the significance of this association by calculating the pearson correlation coefficient and associated p-value.
The Pearson correlation coefficient measures the linear relationship between two variables, ranging from -1 (a perfect negative linear relationship) to 1 (a perfect positive linear relationship), with 0 indicating no linear correlation. The obtained p-value indicated the statistical significance of the correlation coefficient. Typically, a p-value below a specified threshold, such as 0.05, indicated a significant relationship.
Later interpreted the results by determining whether hotel occupancy rates and housing prices had a significant correlation. A significant correlation implied that changes in hotel occupancy might relate to changes in housing prices.
Then adjusted the significance levels to further enhance the analysis based on dataset characteristics.

 

Wed 22 Nov

Performed regression analysis on the economics indicators dataset to predict an outcome variable using selected features and also assessed how well the model explained the variance within the data.
First preprocessed the data and utilised the features from the feature selection process and defined the target outcome variable.
Then divided the dataset into training and testing sets using train_test_split from sklearn.model_selection. Then initialized and trained the model using the training data. Used the test data to make predictions and calculated the R-squared value using r2_score from sklearn.metrics.
The model explained the proportion of variance in the target variable explained by quantifying the R-squared value. A higher R-squared value closer to 1 indicated a better fit of the model to the data, implying more accurate predictions.
Later examined the coefficients of the model using each feature’s impact on the predicted outcome.

Mon 20 Nov

In the economics indicators dataset there are a lot of variables so i performed feature selection which is a critical step in refining datasets for modelling or analysis. Also correlation analysis is utilized to identify redundant or highly correlated features, indicating potential multicollinearity.
So i calculated the correlation coefficients between features and set a threshold (e.g., 0.7 or 0.8), paired with high correlations were identified. From these pairs, one feature was retained.
Additionally, addressing multicollinearity is vital. Therefore, I used Variance Inflation Factor (VIF) to assess how much the variance of a feature is inflated by correlations with other features. Normally, high VIF scores (> 5 or 10) signify multicollinearity.
Features with high VIF scores, from which few were dropped and others were combined into composite variables to mitigate multicollinearity.
Next, i will use these refined set of  features for modelling and evaluate the model’s performance using the same.

Fri 17 Nov

After analysing the “food establishment inspections” dataset, it was determined that the decision tree technique would work best because it can be used for both regression and classification, can handle both numerical and categorical data, implicitly perform feature selection, robust to outliers and missing values.
They use a structure akin to a tree to represent decisions and their possible outcomes. The nodes in the tree stand for features, the branches for decision rules, and the leaves for the result or goal variable.
Decision trees determine the optimal feature to split the dataset at each node based on a variety of parameters, such as information gain, entropy, and Gini index.
The algorithm recursively splits the dataset based on the selected features until a stopping criterion is met. This could be a maximum depth limit, minimum samples at a node, or others to prevent overfitting.
After the tree is constructed, each new instance moves through the tree according to the feature values until it reaches a leaf node, which yields the expected result.
Next, i’ll analyse few more datasets and perform models accordingly.