Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression tasks. They are a powerful tool for creating predictive models due to their simplicity, interpretability, and versatility.

How They Work:

A Decision Tree splits the data into subsets based on the value of input features. It uses a tree structure where each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (in classification) or a continuous value (in regression). The paths from root to leaf represent classification rules.

Advantages:

Interpretability: The model can be visualized and understood by non-experts.
No Need for Feature Scaling: They don’t require normalization of data.
Handle Non-linear Relationships: They can model complex, non-linear relationships.

Challenges:

Overfitting: Without proper tuning, DTs can create overly complex trees that do not generalize well from the training data.
Instability: Small variations in data can result in a different tree being generated.
Biased Trees: DTs can be biased towards classes with more levels.

Gini Impurity, Entropy, and
Information Gain

Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity of a dataset is minimized if the dataset is pure (all elements are of the same class).

The Gini impurity for a set of items with JJ classes is calculated as:

Entropy is a measure of the randomness or disorder within a set of classes. Higher entropy means more disorder or uncertainty. In the context of Decision Trees, a lower entropy is better because it indicates a higher level of certainty or purity.

The entropy for a set with JJ classes is given by:

Information Gain measures the change in entropy or impurity from before to after the set is split on an attribute. It's the difference in impurity measure (either Gini or Entropy) of the original set and the weighted sum of impurity of each subset resulting from the split. The goal in Decision Trees is to maximize information gain — essentially, how much uncertainty in the data was reduced after the split.

Example

Imagine we have a dataset with two features: Weather (Sunny, Overcast, Rainy) and Temperature (Hot, Mild, Cool), and we want to predict if we will play a sport.

Let's calculate information gain for a split on the Weather feature, assuming the entire dataset has an entropy (disorder) of 0.94 and the subsets after splitting by Weather are as follows:

Sunny: 5 Yes, 3 No (entropy 0.954)

Overcast: 4 Yes, 0 No (entropy 0)

Rainy: 3 Yes, 2 No (entropy 0.971)

The information gain for splitting on Weather would be:

Calculating this gives an information gain which tells us how "good" the split is in terms of reducing uncertainty.

Infinite Number of Trees:

It's generally possible to create an infinite number of Decision Trees for a given dataset due to the following reasons:

Feature Selection: There are many ways to select which features to split on at each step of the tree-building process.
Splitting Points: For continuous variables, there are infinite points at which the data can be split
.Tree Depth: A Decision Tree can continue to grow indefinitely, splitting on different features or values unless constraints such as maximum depth or minimum samples per node are set.

These factors contribute to the potentially infinite ensemble of Decision Trees that could be generated from a dataset. Each tree might capture different aspects of the data or reflect different assumptions about its structure. However, in practice, methods like pruning, setting maximum depth, and requiring a minimum number of samples per leaf are used to limit the size of the tree and combat overfitting.

Data Prep

To forecast national park visitation with Decision Trees (DTs), we gathered a diverse set of data including historical visit counts, weather conditions, and park characteristics. The data underwent cleaning to correct anomalies and fill missing values, ensuring quality and consistency. We chose features with potential predictive power, such as past visitation trends and environmental factors, while discarding irrelevant ones to streamline the model.

We then transformed categorical variables into numerical formats suitable for machine learning algorithms through encoding techniques. The dataset was split into training and testing sets, traditionally allocating 70-80% for training and the remainder for testing, to validate the model's predictive capabilities on unseen data.

Using the training set, we trained a DT model, fine-tuning parameters to balance complexity with generalization. We measured the model's accuracy, precision, and recall on the test set to gauge its performance, focusing on how well it distinguished between increased and stable visitation patterns.

How Training and Testing Sets are Created

The dataset is randomly split into two parts:

Training Set: This subset of the data is used to train the machine learning model. It includes both the input features (e.g., weather conditions, past visitation numbers) and the corresponding target values (e.g., visitor_pct_change).

Testing Set: This subset is used to test or evaluate the model’s performance. It is kept separate from the training data and is not seen by the model during the training phase. Like the training set, it includes both the input features and the target values.

Raw Data

Raw Data

Training Data

Training Data

Testing Data

Testing Data

Results and Conclusions

For the project on predicting national park visitation, we've trained three different Decision Tree (DT) models. These models were built using a dataset that included features like the year 2020 visitation data, average ratings, difficulty ratings, and environmental factors such as temperature and precipitation.

Result Interpretation:

Each tree was trained on a different subset of the data to ensure diverse learning and to mitigate overfitting. This approach provided a broader perspective on the model's predictions and accounted for potential variability in the data.

Data Splitting:

The original dataset was divided into disjoint training and testing sets. This split is critical because it prevents information leakage from the training data to the model evaluation phase, ensuring that the performance metrics we observe are indicative of how well the model would perform on unseen data. Typically, a 70-30 or 80-20 split is used, ensuring enough data for training while retaining a significant portion for unbiased testing.

Confusion Matrix and Accuracy:

A confusion matrix was used to visualize the model's performance, clearly showing the true positives, true negatives, false positives, and false negatives. This matrix was crucial for calculating the accuracy of the model, as well as other metrics like precision and recall, which inform us about the model's reliability.

Decision Trees Visualization:

The three DTs were visualized to interpret the decisions and the splits made by the models. Each node in the tree represents a decision based on a feature, leading to a prediction at the leaf nodes. These visualizations help us understand the decision-making process of the model and which features contribute most to predicting changes in visitation.

Decision Tree 1

Decision Tree 2

Decision Tree 3

Decision Tree model give us an insight into its performance in predicting national park visitation changes:

Accuracy (0.6250 or 62.5%): This means that the model correctly predicted whether there was a visitation increase or not 62.5% of the time across the dataset. It's a general measure of performance that does not give detail on the type of errors (false positives or false negatives).
Precision (0.7500 or 75%): Precision tells us that when the model predicts an increase in visitation, it is correct 75% of the time. High precision suggests that there are relatively few false positives; when the model predicts an increase, it is likely to be right, which is useful if the cost of a false positive is high (e.g., unnecessary resource allocation for expected visitors).
Recall (0.3750 or 37.5%): Recall is a measure of the model's ability to find all the relevant cases within a dataset. A recall of 37.5% means the model identifies 37.5% of all actual visitation increases. This is fairly low, implying that the model misses a significant number of actual increases (false negatives). This could be problematic if it's crucial to capture as many increases as possible, such as for preparing necessary facilities and staff to manage the influx of visitors.

Given these performance indicators, your Decision Tree model is more conservative in predicting an increase in visitation. It prefers to make fewer mistakes when it predicts an increase but at the cost of missing several actual increases.

Learnings and Future Projections:

Through this project, we learned about the significance of feature selection and data partitioning in training predictive models. The results from the DTs can guide future strategies for park management, such as staffing needs and resource allocation during peak times. For future projections, the models can be updated with new data to refine predictions and account for changing patterns in park visitation.

Conclusion:

The ensemble of DT models offers a robust tool for predicting park visitation changes, capturing complex nonlinear relationships within the data. Despite the variability in individual tree predictions, when combined, they can provide a more reliable forecast. The insights gained demonstrate the model's utility in operational planning and resource management for national parks.

Link to Code

Decision Tree

Gini Impurity, Entropy, and Information Gain

Example

Data Prep

Raw Data

Training Data

Testing Data

Results and Conclusions

Decision Tree 1

Decision Tree 2

Decision Tree 3

Gini Impurity, Entropy, and
Information Gain