Unable to understand this question. Kindly help.
Iesha Azarov is a senior analyst at Ganymede Moon Partners (Ganymede), where he works with junior analyst Pàola Bector. Azarov would like to incorporate machine learning (ML) models into the company’s analytical process. Azarov asks Bector to develop ML models for two unstructured stock sentiment datasets, Dataset ABC and Dataset XYZ. Both datasets have been cleaned and preprocessed in preparation for text exploration and model training.
Following an exploratory data analysis that revealed Dataset ABC’s most frequent tokens, Bector conducts a collection frequency analysis. Bector then computes TF–IDF (term frequency–inverse document frequency) for several words in the collection and tells Azarov the following:
Statement 1 | IDF is equal to the inverse of the document frequency measure. |
Statement 2 | TF at the collection level is multiplied by IDF to calculate TF–IDF. |
Statement 3 | TF–IDF values vary by the number of documents in the dataset, and therefore, model performance can vary when applied to a dataset with just a few documents. |
Bector notes that Dataset ABC is characterized by the absence of ground truth.
Bector turns his attention to Dataset XYZ, containing 84,000 tokens and 10,000 sentences. Bector chooses an appropriate feature selection method to identify and remove unnecessary tokens from the dataset and then focuses on model training. For performance evaluation purposes, Dataset XYZ is split into a training set, cross-validation (CV) set, and test set. Each of the sentences has already been labeled as either a positive sentiment (Class “1”) or a negative sentiment (Class “0”) sentence. There is an unequal class distribution between the positive sentiment and negative sentiment sentences in Dataset XYZ. Simple random sampling is applied within levels of the sentiment class labels to balance the class distributions within the splits. Bector’s view is that the false positive and false negative evaluation metrics should be given equal weight. Select performance data from the cross-validation set confusion matrices is presented in Exhibit 1:
Exhibit 1
Performance Metrics for Dataset XYZ
Confusion Matrix | CV Data (threshold p-value) | Performance Metrics | |||
Precision | Recall | F1 Score | Accuracy | ||
A | 0.50 | 0.95 | 0.87 | 0.91 | 0.91 |
B | 0.35 | 0.93 | 0.90 | 0.91 | 0.92 |
C | 0.65 | 0.86 | 0.97 | 0.92 | 0.91 |
Azarov and Bector evaluate the Dataset XYZ performance metrics for Confusion Matrices A, B, and C in Exhibit 1. Azarov says, “For Ganymede’s purposes, we should be most concerned with the cost of Type I errors.”
Azarov requests that Bector apply the ML model to the test dataset for Dataset XYZ, assuming a threshold p-value of 0.65. Exhibit 2 contains a sample of results from the test dataset corpus.
Exhibit 2
10 Sample Results of Test Data for Dataset XYZ
Sentence # | Actual Sentiment | Target p-Value |
1 | 1 | 0.75 |
2 | 0 | 0.45 |
3 | 1 | 0.64 |
4 | 1 | 0.81 |
5 | 0 | 0.43 |
6 | 1 | 0.78 |
7 | 0 | 0.59 |
8 | 1 | 0.60 |
9 | 0 | 0.67 |
10 | 0 | 0.54 |
Bector makes the following remarks regarding model training:
- Remark 1: Method selection is governed by such factors as the type of data and the size of data.
- Remark 2: In the performance evaluation stage, model fitting errors, such as bias error and variance error, are used to measure goodness of fit.
Q. Based on Exhibit 2, the accuracy metric for Dataset XYZ’s test set sample is closest to:
- 0.67.
- 0.70.
- 0.75.
You have to be aware of the concept of ‘Confusion Matrix’ for this question where ou need to identify True Positives (TP) and True Negatives (TN) in the Exhibit.
TP- sentence #1,4,6 (i.e. 3)
TN- sentence#2,5,7,10 (i.e. 4)
Hence accuracy metric= (4+3)/10= 0.7