Banner image of this report

 

1 Objective

1.1 Which questions did we try to answer in the process?

This machine learning project tries to identify Persons of Interest from the Enron Scandal. The Enron Scandal was the largest case of corporate fraud in the history of the United States. Several people were indigted or found guilty of misappropriation of large sums of money. People involved in the Enron scandal are called “persons of interest” (POI) in this assignment.

We are building a machine learning algorithm that can, given some financial data of an employee and some data from the so-called Enron Corpus predict if that person could be a person of interest.

1.2 How does the dataset we choose help?

Many Persons of Interest were main benefactors from the fraud and therefore took away huge sums of money, either as salary, bonus, as stock options or as other compensations. The Enron Corpus moreover gives an idea about which employees were in frequent contact with one another. From this data source we hope to get insights how information about malpractices was spreading through the company.

1.3 How is machine learning useful in this context?

No single feature about a person (financial or email-related) can give us a clear yes-no answer about whether a person is “of interest” or not. Machine Learning helps us discover patterns in the interplay of the features available to us. It should also give us an opportunity to predict whether persons are “of interest” when we see new data about persons, of which we do not have any information yet as to whether they are “of interest” (meaning: being involved in the scandal) or not.

1.4 Were there any outliers? How did you handle them?

Of the 146 observations / persons in the data set, there were two outliers:

  • a person “named” “TOTAL”, probably included while scrapping the financial data off a table from a web page or document. As a value containing the sum of all features, it was laying far outside the spectrum for all features and easily recognizable

  • a person “named” “THE TRAVEL AGENCY IN THE PARK” which I also excluded due to it not being a real “person” at all.

I hard-codedly removed both data points from the data set as preproecessing step.

Without outliers, the data set contained 144 observations and of those, 18 (12.5%) were classified as POI and 126 (87.5%) as non-POI.

2 Feature Selection

2.1 What features did you use? What selection process did you use?

As a preprocessing step before selecting any features, I removed (or rather “disabled”“) features which had a high number of missing (or”NA“) values. I set the threshold of NA-fraction at 75% since 50% removed too many features and impacted performance. The following features with their relative fractions of NA values were removed:

loan_advances                0.979167
director_fees                0.895833
restricted_stock_deferred    0.881944

Narrowing features down, I tried both manual feature selection after initial explorative analysis of the data and automatic feature selelction using the SelectKBest preprocessor.

You can choose to enable or disable automatic feature selection using SelectKBest. Using -f True or -f False. The default is -f False (manual feature selection) since that provided best performance for most of the algorithms, including the one we chose.

For selecting features manually using explorative data analysis I implemented a helper class which shows some GUI windows with POIs and non-POIs highlighted in different color. An example window is shown below; the window is capable of displaying multiple plots in a tabbed view. It is also possible to control certain display parameters like number of histogram bins or alpha transparency with a spinner widget.

This GUI can be shown when calling the poi_id.py script with the options -g univariate_analysis and -g bivariate_analysis respectively.

Screenshot of the bivariate analysis GUI

 

From this exploratory analysis I selected the following feature set to be most indicative for POI-vs.-non-POI classification.

salary, bonus, deferral_payments, loan_advances (rejected because fraction of 
  NA values is >0.75), expenses, exercised_stock_options, deferred_income, 
  other, rate_poi_to_this_person, rate_this_person_to_poi, 
  rate_shared_receipt_with_poi

2.2 Did you have to do any scaling? Why (not)?

I compared performance of each algorithm by evaluating the algorithm without feature scaling, with MinMaxScaler feature scaling and using Principal Component Analysis (PCA) without feature scaling.

The script poi_id.py supports manually setting which configuration to run using the option -s on, -s off or -s pca, while using PCA (without scaling) is the default (-s pca) and best performing setting for all algorithms which were tested. For comparing exact performance, see Algorithms.

Why is PCA ran without feature scaling?: I was made aware by a Udacity Forum Post that PCA usually underperforms when scaling is employed before calling it, since PCA calculates its axes by considering the variance of features, which is re-scaled when scaling each features values to the interval [0, 1].

I tested the hypothesis by re-running the whole simulation both with PCA using scaling and PCA not using scaling. The results where generally improved performance for all algorithms if scaling was not used. Only Support Vector Machines stopped working any time scaling was disabled.

The exact numbers of PCA-with-scaling vs. PCA-without-scaling is documented in a separate PDF file as appendix. Frankly, the result that PCA performed better without scaling puzzled me, also because one project reviewer at Udacity commented “Something to note here, PCA’s process occurs in euclidean space and similarly to algorithms like SVM e.t.c requires scaling to work optimally.”

It should be noted, that some algorithms need some sort of scaling to run at all. For example, the Support Vector Classifier, when used with a polynomial kernel amongst other values in GridSearchCV did not terminate within 45 minutes runtime on my machine if scaling was turned off.

2.3 Which features did you try to make / modify and why?

First, I cross-validated the existing total_* features by checking if the sum of their constituents resulted in the value present in the data set. For performing this step, poi_id.py has an option -l <filename> to write out the whole data set to a CSV file which can be examined and further analyzed in, e.g., MS Excel.

I found that there were a few values where a total_* feature was set to NA despite of actually some of its summands being present. For these records, I could just recalculate the total value. For other data points, I had to hard-codedly perform a “one-off” hack/change in the data set to ensure consistency. The whole data cleaning step is done as a manual pre-processing step. The following images show the process in detail:

Cleaning process of feature total_payments

Cleaning process of features total_stock_value

I also created three new features: from_this_person_to_poi, from_poi_to_this_person, shared_receipt_with_poi:

\(\text{rate_poi_to_this_person} = \frac{\text{from_poi_to_this_person}}{\text{from_messages}}\)

\(\text{rate_this_person_to_poi} = \frac{\text{from_this_person_to_poi}}{\text{to_messages}}\)

\(\text{rate_shared_receipt_with_poi} = \frac{\text{shared_receipt_with_poi}}{\text{to_messages}}\)

The rationale behind creating those features is that some people might write less, some people might write more emails. We really want to look at what is the percentage of emails someone wrote to a person of interest in order to estimate how likely they were to be involved in malpractices of other persons of interest.

I did, however, leave the original features in the data set for the case of automatic feature selection with SelectKBest and used this as a validation technique of those newly created features. I cannot assume a direct linear corellation with those features, therefore, they should not prove an obstacle to the algorithm.

One additional disclaimer should be added concerning the email features in general: since those features essentially make use of the knowledge whether persons, including the persons we assigned to the test set, are POIs (which is the label we try to predict), we make ourselves guilty of “data leakage” or “test set peeking” (see the relevant discussion in the Udacity forum or the associated thread on Quora).

How to avoid this problem?: The proper way would be to re-compute the email-features during each train-test-split fold for training and test set separately. For the training set, one would re-compute the email-features from the Enron corpus using only the subset of emails for persons in the training set. For the test set, one could compute these values based on the whole data set.

For efficiency reasons, instead of running through the email corpus every time one generates the email-features, one could generate a graph-like data structure with …

  • vertices representing the persons
  • a set of directed edges being emails “sent” (or, reversely, “received”)
  • a set of directed edges being Ccs shared

This graph (or a sub-graph during the training process) may be re-used to calculate the email features on-the-fly during each train-test-split-fold. The following image showcases the principle of this idea for the sent/received emails, however, an implementation is out-of-scope for this project.

Avoiding data leakage by re-calculating email features from a (sub-)graph

 

2.4 Which were the feature scores for SelectKBest? How did you choose the SelectKBest parameters?

The image below shows the SelectKBest results for different classifiers. Every classifier used the same set of 10 automatically selected features (shown on the x-axis). The number (10 features) was chosen in order to achieve comparable results to the 10 manually selected features. Reducing number of features even lower (by, e.g., rejecting more features with high number of NA-values) impacted performance negatively.

As can be seen from the following diagram, there is a high overlap between features selected automatically by SelectKBest and features selected by intuition after explorative analysis. Note, that albeit first selected manually during explorative data analysis, the feature loan_advances, having 97% NA values, is excluded during the final analysis.

Outcome of both automatic and manual feature selection process

 

Note also, that if principal component analysis is employed, the whole feature set (excluding rejected features) is used, but then reduced to half the dimensionality of the number of selected features, that is, to, 5 dimensions.

3 Algorithms

3.1 What algorithm did you end up using? What other one(s) did you try?

I tried a variety of algorithms, all selectable with the -c <clf_id> option supplied to poi_id.py. Some of them make use of AdaboostClassifier (DecisionTree or RandomForest), some of them are wrapped in GridSearchCV supporting parameter tuning by cross-validation (see Tuning the Algorithm) (SVC, KNeighbors).

I tried the following algorithms (in that order):

  1. Gaussian Naive Bayes (-c 0): worked as a minimum viable product and “low bar” for algorithms to cross
  2. AdaBoosted Decision Tree (-c 1): was quick to get to run with acceptable results but runs slow
  3. Support Vector Classifier (-c 2): was very unstable and only worked with feature scaling. Gave only mediocre results.
  4. Adaboosted RandomForest (-c 3): took very long and scores just very mediocre performance. Useless.
  5. KNeighbors (-c 4): quite good performance from the start, showed highest performance increase of all if properly tuned with PCA.
  6. LogisticRegression (-c 5): came in with quite mediocre performance from the start, no time was spent trying to improve algorithm by tuning it.
  7. LDA (-c 6): mediocre performance, no time was spent trying to improve algorithm by tuning it.

I finally chose classifer (4.), the KNeighbors classifier due to its superior performance and runtime.

3.2 How did you model performance differ between algorithms?

The following table lists all classifiers, separated by horizontal lines and ordered by their performance. For each classifier, different configurations are also displayed with the highest performing on top.

Performance is evaluated by the following criteria in descending priority:

  1. An algorithm having precision and recall greater than 0.3 is better than one which does not.
  2. The higher the F1 score, the better.
  3. The shorter the average runtime, the better.

Precision, recall and F1 score are evaluated with the supplied tester.py script. The average runtime denotes the average time it takes to run one train-test-split cross validation fold.

Just showing the highest performing configurations (pertaining to automatic feature selection and feature scaling / PCA), we can also reduce this table in order to compare the classifiers more directly. Notice, how for all algorithms except SVC and AdaBoost’ed Decision Tree worked best without automatic feature selection and with PCA.

The following chart, finally, shows precision and recall of each algorithm; notice the superiority of the KNeighbors, followed by AdaBoost’ed Decision Tree. We can also see the effect that PCA has on KNeighbors for manual feature selection: Using PCA, we have much higher precision, but lower recall.

Looking at the average runtime of the algorithms we can also observe that our algorithms, once properly tuned, take a while to complete, while the more simple algorithms complete rather quickly. “Adaboosted RandomForest” is the odd element in this collection, taking very long and producing only mediocre results.

4 Tuning the Algorithm

4.1 What does it mean to tune the parameters of an algorithm?

Tuning the algorithm means finding parameters of an algorithm which optimize its evaluation metric. Ideally, we would use one of the evaluation metrics we also use to evaluate the algorithm as a whole. For manual tuning, this can be the F1 score, giving a sweet-spot between precision/recall.

For automatic tuning using GridSearchCV, only “precision” is implemented and was used wherever possible. GridSearchCV uses a list of potential parameter values and tries every combination during training and chooses the one with best performance.

Parameter tuning is only done once and should not be done during each training step of the algorithm. Therefore, a option switch in poi_id.py is implemented to find the best GridSearchCV value: run poi_id.py with -x out in order to run an algorithm with GridSearchCV and write the optimal parameters to a file called poi_id_gridsearchcv.csv. Run it with -x in in order to read the best parameters from that file (default) or run it with -x off in order to deactivate this method and use
hard coded default values for each algorithm.

4.2 What can happen if you don’t do this well?

If too few values are tested, we might not achieve best performance. If too many values are tested the classifier itself may be too tightly fit to the training set (overfit) and not predict useful results on the test set.

4.3 How did you tune the parameters of your algorithm?

Values of Principal Components Analysis (and whether to use it) were tuned by hand for each classifier when building that classifier for the final cross-evaluation. RandomizedPCA was executed using n_components=3.

For the algorithms supporting GridSearchCV, the following parameters were tuned:

4.3.1 KNeighbors (the algorithm of our choosing)

The following grid was tested:

{ 'algorithm': ['ball_tree', 'kd_tree', 'auto'],
  'weights': ['uniform','distance'],
  'n_neighbors': [1,2,3] }

The best parameters were for the various configurations:

Feature Scaling Feature Selection n_neighbors weights algorithm
off/none automatic 3 uniform ball_tree
off/none manual 1 uniform ball_tree
on/MinMax automatic 1 uniform ball_tree
on/MinMax manual 1 uniform ball_tree
pca automatic 3 uniform ball_tree
pca manual 3 uniform ball_tree

4.3.2 Support Vector Classifier

The SVC vector could only be computed once automatic feature scaling was enabled. The search grid was:

{ 'C': [1e3, 1e4, 1e5],
  'gamma': [0.0001, 0.001, 0.01, 0.1],
  'kernel': ['rbf', 'poly'] } 

The best parameter were for the various configurations:

Feature Scaling Feature Selection kernel C gamma
on/MinMax automatic poly 100000.0 0.1
on/MinMax manual poly 100000.0 0.01

4.3.3 Other Algorithms

For other algorithms (Adaboost Random Forest, Decision Tree), values were tuned manually.

5 Validation Strategy

5.1 What is validation?

Validation means excluding some of the data from the training set and using it after fitting the classifier to assess the classifier’s performance. While the training set is used to train / fit the classifier with a set of given features and labels, the features in the test set are used to run the algorithm and predict labels. Those are then compared to the actual labels in the test set. Different evaluation metrics exist depending on how exactly we want to define performance.

5.2 What’s a classic mistake you can make if you do it wrong?

Naively, one could use the same data for training and testing or just use the whole data set for training and not test at all. However, that way one would certainly overfit the classifier to a particular training set. The classifier would be too biased to make predictions which do not generalize to new data. We prevent that from using a test set, which provides that new data and therefore a more honest evaluation.

5.3 How did you validate your analysis?

I re-implemented the StratifiedShuffleSplit cross-validation also found in tester.py. This method divides the dataset into train and test set multiple times (in the tester: 1000 times) and each fold sets aside 10 percent of the data as test set. Since only 12.5% of the data are POI, this resulted in a few runs with no POI samples in the test set or no predicted samples. During training, of one train-test-split fold, this threw an error since precision, the metric that was optimized for, could not be computed without samples predicted to be POI.

By choosing multiple train-test-set-splits and averaging the performance over all predictions we can smooth out potential biases which might by chance lay in a particular train-test-split. Overall, we are also using our whole dataset more economically.

The script poi_id.py supports a command line parameter -t <train_test_set_split_folds> which defines how many train and test set splits are performed (default = 10, use 1000 to emulate tester.py and 1 to perform only one train-test-split).

6 Evaluation Metrics

6.1 What are the (\(\ge 2\)) evaluation metrics you chose?

I chose the following metrics:

  • \(\text{Precision} = \frac{\text{true_positives}}{\text{true_positives} + \text{false_positives}}\)
    Of all predictions labelled “positive” the fraction that are truely to be labelled “positive” (“true positive”).
  • \(\text{Recall} = \frac{\text{true_positives}}{\text{true_positives + false_negatives)}}\)
    Of all observations that should be labelled “positive” the number of observations actually predicted as “positive” (“true positive”).
  • \(\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)
    The harmonic mean of Precision and Recall, taking both measures in account and accounting for the fact that there is a trade-off between the two.

See below (question “What would that mean in laymen terms…”) for an explanation of these metrics in terms related to the project.

6.2 What is your average performance on those metrics?

The performance of our algorithm of choice (KNeighbors) with our preferred configuration (no scaling, no feature selection) is as follows (as already shown in section Algorithms).

\(\text{F1} = 0.487, \text{Precision} = 0.666, \text{Recall} = 0.384\)

One can see that the algorithm is heavily focussed on precision since GridSearchCV optimizes for this evaluation metric.

6.3 What would that mean in laymen terms about the algorithm’s performance?

For our algorithm, this result would mean, that the chance of a predicted POI indeed being a person of interest is 66.6%. Conversely, only 38.4% of persons of interest are actually identified. In this case, one might legitimately argue that one should optimize for the first percentage in order to keep the chance that someone is flagged as POI erroneously low.

7 Reflection

This project was a fun learning experience and taught me that there are many ready-made tools in scikit learn and that getting them to run is, in fact, very easy. I used it also as a chance to deepen my R and python knowledge. A few things I would have done differently would I start over or re-visit this project later:

8 Appendix: System Design

This section should provide the interested reader with some insights about the design of the code that I wrote and its enhanced capabilities. It quickly became clear to me that I would need to add some more functionality to the poi_id.py script than just the code to create one classifier, run it and perform cross-validation.

I wanted to follow a more structured approach where I have tools to re-run previous classifiers or GridSearchCV for different configurations, explore the data set as both spreadsheet or visually, run and evaluate multiple classifiers in a batch sequence with the goal of comparing metrics and finally, visualizing the metrics, probably with some nice ggplot2 graphs in R. The goal here was mostly to not loose track of any of the configurations I previously tried and also, to be able to re-visit them with other settings at a later point very easily.

I ended up enhancing the poi_id.py script to accept different command line options to show GUIs for data exploration, change the cross-validation size or configuration, switch to different (previously tried) classifiers or modify pipeline configuration, like automatic scaling, feature selection or PCA. Using no command line options whatsoever, the script still fulfills the functionality specified by the Data Analyst Nanodegree project rubric of creating and dumping my preferred classifier pipeline of choice, dataset and feature list.

Along with poi_id.pythere is also an helper class to show a Qt-based GUI and a wrapper script poi_id_batch.py to record metrics of various different pipelines all run in sequence. Minor tweeks, like varying the number of automatically selected features, can have a big impact on all classifiers I tried previously. In order to make a honest performance asessment, quite frequently, I had to re-run all 6 classifiers in different configurations with minor tweeks, an endeavour which took about 6 hours on my machine for each batch run and could be easily automated and done overnight.

The following figure shows the actors and components of the target system and the interaction between those. The color defines the implemented use case. All scripts and files are in the final_project directory except of the *.Rmd and the *.html files, which are in the project root directory.

The system architecture with grader and data analyst as actors

 

The code is thoroughly documented and all runnable python scripts support -h to show possible command line options. The idea is, that this code could serve me or others when evaluating different machine learning pipelines for later projects. If nothing else, it was also a fun learning experience and deepened my knowledge of R, Python, and also PyQt. :)

9 References