Using Unsupervised Machine Learning to Discover Unknown Patterns
This project consists of four technical analysis deliverables.
Deliverable 1: Preprocessing the Data for PCA
Deliverable 2: Reducing Data Dimensions Using PCA
Deliverable 3: Clustering Cryptocurrencies Using K-means
Deliverable 4: Visualizing Cryptocurrencies Results
Stakeholders are interested in offering a new cryptocurrency investment portfolio for its customers. The company, however, is lost in the vast universe of cryptocurrencies. We’ll create a report that includes what cryptocurrencies are on the trading market and how they could be grouped to create a classification system for this new investment.
The data we will be working with is not ideal, so it will need to be processed to fit the machine learning models. Since there is no known output for what we are looking for, we will use unsupervised learning. To group the cryptocurrencies, we decided on a clustering algorithm. We’ll use data visualizations to share our findings.
Data source:
Software:
Using Pandas, we’ll preprocess the dataset in order to perform PCA in Deliverable 2.
Using the Principal Component Analysis (PCA) algorithm, we’ll reduce the dimensions of the X DataFrame to three principal components and place these dimensions in a new DataFrame.
Using the K-means algorithm, we’ll create an elbow curve using hvPlot to find the best value for K from the pcs_df DataFrame created in Deliverable 2. Then, we’ll run the K-means algorithm to predict the K clusters for the cryptocurrencies’ data.
Using our knowledge of creating scatter plots with Plotly Express and hvplot, we’ll visualize the distinct groups that correspond to the three principal components you created in Deliverable 2, then we’ll create a table with all the currently tradable cryptocurrencies using the hvplot.table() function.
The following five preprocessing steps have been performed on the crypto_df DataFrame:
All cryptocurrencies that are not being traded are removed
The IsTrading column is dropped
All the rows that have at least one null value are removed
All the rows that do not have coins being mined are removed
The CoinName column is dropped
A new DataFrame is created that stores all cryptocurrency names from the CoinName column and retains the index from the crypto_df DataFrame
The get_dummies() method is used to create variables for the text features, which are then stored in a new DataFrame, X
The features from the X DataFrame have been standardized using the StandardScaler fit_transform() function
The final DataFrame is shown below, Figure 1.1
Figure (1.1) X_scaled DataFrame: X DataFrame have been standardized using the StandardScaler fit_transform() function.
The final DataFrame is shown below, Figure 1.2
Figure (1.2) X_pca_df DataFrame
The K-means algorithm is used to cluster the cryptocurrencies using the PCA data, where the following steps have been completed:
Figure (1.3) Elbow curve
Figure (1.3) K-Means Algorithm: used to cluster the cryptocurrencies.
Figure (1.3) Clustered_df DataFrame.
Figure (1.3) 3D Scatter plot
Figure (1.3) 3D Scatter plot with CoinName and Algorithm on hove
Figure (1.3) hvplot table
Figure (1.3) Total number of tradable cryptocurrencies
Figure (1.3) DataFrame that has the scaled data with the clustered_df DataFrame index.
Figure (1.3) hvplot scatter plot
On this project, we worked primarily with the K-means algorithm, the main unsupervised algorithm that groups similar data into clusters. And build on this by speeding up the process using principal component analysis (PCA), which employs many different features to reduce the dimensions of the DataFrame.
Then using the K-means algorithm, we created an elbow curve using hvPlot to find the best value for K. Then, runned the K-means algorithm to predict the K clusters for the cryptocurrencies’ data.
Finally we created scatter plots with Plotly Express and hvplot, to visualize the distinct groups that correspond to the three principal components. Then created a table with all the currently tradable cryptocurrencies using the hvplot.table() function.
The ultimate goal for this visualizations is to present the data in a story that would be interactive, easy to understanding and that provide the correct information to help the stakeholders in the decision making process.