Hazard Extraction and Analysis of Trends (HEAT): ICS-209-PLUS

This notebook provides a demonstration of the hazard extraction and analysis of trends (HEAT) framework applied to the ICS-209-PLUS dataset, available at https://data.nal.usda.gov/dataset/data-all-hazards-dataset-mined-us-national-incident-management-system-1999%E2%80%932014

This example uses the trend analysis module from MIKA’s knowledge discovery toolkit, as well as the Data and ICS utilities.

Prior to performing the analysis in this notebook, hazards are extracted using BERTopic from topic model plus using the ICS_hazard_extraction_script.

Imports

[1]:

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.style
matplotlib.style.use("seaborn-v0_8")
import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.rcParams["font.family"] = "Times New Roman"
import seaborn as sn
sn.color_palette("hls", 17)
import scipy.stats as st
import math
import dill
from pingouin import rcorr

[2]:

import sys
import os
sys.path.append(os.path.join("..", "..", "..", ".."))

from mika.kd.trend_analysis import *
from mika.utils.ICS import *
from mika.utils import Data

For figure consistency, global configuration variables are defined here.

[3]:

figsize = (6, 4)
fontsize = 14
matrix_figsize = (8,9)
matrix_fontsize = 10

Data Import

The preprocessed ICS-209-PLUS dataset is loaded in using the Data utility

[4]:

document_id_col = "Unique IDs"
extra_cols = ["CY","DISCOVERY_DATE", "START_YEAR", "REPORT_DOY", "DISCOVERY_DOY",
              "TOTAL_PERSONNEL", "TOTAL_AERIAL", "PCT_CONTAINED_COMPLETED"]
list_of_attributes = ["Combined Text"]
file = os.path.join('topic_model_results', 'preprocessed_data_combined_text.csv')
ICS = Data()
ICS.load(file, preprocessed=True, id_col=document_id_col, text_columns=["Combined Text"], preprocessed_kwargs={'drop_short_docs':False, 'drop_duplicates':True})

Now the preprocessed dataframe that will be used is defined:

[5]:

preprocessed_df = ICS.data_df

The dataframe is then filtered to ensure it includes the correct years and incidents.

[6]:

incident_file = os.path.join(os.path.abspath(os.path.join(os.getcwd(), os.pardir, os.pardir, os.pardir, os.pardir)),'data','ICS','summary_reports_cleaned.csv')
incident_summary_df = pd.read_csv(incident_file,low_memory=False)
incident_summary_df = incident_summary_df.drop("Unnamed: 0", axis=1)
incident_summary_df = incident_summary_df.loc[incident_summary_df["START_YEAR"]>=2006].reset_index(drop=True)

fire_ids = incident_summary_df['INCIDENT_ID'].unique()
sitrep_ids = preprocessed_df['INCIDENT_ID'].unique()
incident_summary_df = incident_summary_df[incident_summary_df['INCIDENT_ID'].isin(sitrep_ids)].reset_index(drop=True)

Hazard Extraction

First hazards are identified in documents using BERTopic modeling results and a hazard interptretation spreed sheet created from the topics.

The hazard file and topic modeling results file are specified here:

[7]:

hazard_file =  os.path.join('topic_model_results','hazard_interpretation_v1.xlsx')
results_file = os.path.join('topic_model_results',"Combined Text Sentences_BERT_topics_modified.csv")

[8]:

hazard_interpretation_df = pd.read_excel(hazard_file, sheet_name='topic-focused')
categories = hazard_interpretation_df['Hazard Category'].tolist()
hazards = hazard_interpretation_df['Hazard name'].tolist()

The hazard and results file are passed into the identify_docs_per_hazard function, which returns the frequency and documents associated with each hazard in addition to the hazard words and topics per each document. Hazard metrics specific to the ICS are calulcuated using calc_ICS_metrics:

[9]:

# frequency, docs_per_hazard, hazard_words_per_doc, topics_per_doc, hazard_topics_per_doc = identify_docs_per_hazard(hazard_file, preprocessed_df, results_file, text_field='Combined Text', results_text_field='Combined Text Sentences_BERT_to', time_field="CY", id_field='Unique IDs', doc_topic_dist_field=None, topic_thresh=0.0)
# time_of_occurence_days, time_of_occurence_pct_contained, frequency, fires, frequency_fires = calc_ICS_metrics(docs_per_hazard, preprocessed_df, id_col="INCIDENT_ID", unique_ids_col='Unique IDs', rm_outliers=False)

Results can be saved for future use using pickle.

[10]:

# with open("OTTO_days_w_outliers.pkl", "wb") as f:
#     dill.dump(time_of_occurence_days, f)
# with open("OTTO_pct_w_outliers.pkl", "wb") as f:
#     dill.dump(time_of_occurence_pct_contained, f)
# with open("frequency_w_outliers.pkl", "wb") as f:
#     dill.dump(frequency, f)
# with open("fires_w_outliers.pkl", "wb") as f:
#     dill.dump(fires, f)
# with open("frequency_fires_w_outliers.pkl", "wb") as f:
#     dill.dump(frequency_fires, f)
# with open("docs_per_hazard_w_outliers.pkl", "wb") as f:
#     dill.dump(docs_per_hazard, f)
# with open("hazard_words_per_doc_w_outliers.pkl", "wb") as f:
#     dill.dump(hazard_words_per_doc, f)
# with open("topics_per_doc_w_outliers.pkl", "wb") as f:
#     dill.dump(topics_per_doc, f)
# with open("hazard_topics_per_doc_w_outliers.pkl", "wb") as f:
#     dill.dump(hazard_topics_per_doc, f)

Since the results have already been saved, we can load them using pickle:

[11]:

with open("OTTO_days_w_outliers.pkl", "rb") as f:
    time_of_occurence_days = dill.load(f)
with open("OTTO_pct_w_outliers.pkl", "rb") as f:
    time_of_occurence_pct_contained = dill.load(f)
with open("frequency_w_outliers.pkl", "rb") as f:
    frequency = dill.load(f)
with open("fires_w_outliers.pkl", "rb") as f:
    fires = dill.load(f)
with open("frequency_fires_w_outliers.pkl", "rb") as f:
    frequency_fires = dill.load(f)
with open("docs_per_hazard_w_outliers.pkl", "rb") as f:
    docs_per_hazard = dill.load(f)
with open("hazard_words_per_doc_w_outliers.pkl", "rb") as f:
    hazard_words_per_doc = dill.load(f)
with open("topics_per_doc_w_outliers.pkl", "rb") as f:
    topics_per_doc = dill.load(f)
with open("hazard_topics_per_doc_w_outliers.pkl", "rb") as f:
    hazard_topics_per_doc = dill.load(f)

To evaluate the quality of the hazard extraction, we recommend randomly sampling 1000 documents to manually label as containing or not containing each hazard. This can be done using the sample_for_accuracy function. The 1000 document set is split into two 500 document sets for validation and testing.

[12]:

results_path=os.path.join('topic_model_results')

First we calculate the classification metrics on the validation set:

[13]:

metrics, true, pred = calc_classification_metrics(os.path.join('topic_model_results', 'labeled_ICS.csv'), docs_per_hazard=docs_per_hazard, id_col='Unique IDs')

Next we calculate the classification metrics on the test set:

[14]:

test_metrics, _, _ = calc_classification_metrics(os.path.join('topic_model_results', 'labeled_ICS_test_set_full.csv'), docs_per_hazard=docs_per_hazard, id_col='Unique IDs')

Mismatches between HEAT and manual labels are detected and saved to an excel file:

[15]:

_ = examine_hazard_extraction_mismatches(preprocessed_df, true, pred, hazards, hazard_words_per_doc=hazard_words_per_doc, topics_per_doc=topics_per_doc, hazard_topics_per_doc=hazard_topics_per_doc, results_path=results_path, id_col='Unique IDs', text_col='Combined Text')

To display the results tables consistently, we first create the primary results table then sort the hazard extraction evaluation table according to the order of the hazards in the primary table

[16]:

years = preprocessed_df["CY"].unique()
years.sort() #sort years
table_data = create_primary_results_table(time_of_occurence_days, time_of_occurence_pct_contained, frequency, frequency_fires, preprocessed_df, categories, hazards, years, interval=True) #primary results table with metrics
table = pd.DataFrame(table_data).sort_values('Hazard Category').reset_index(drop=True) #sort by category
hazards_sorted = table['Hazard Name'].tolist() #get sorted hazards for formatting evaluation table

Now we can display the hazard extraction evaluation table:

High precision -> only counting instances of the hazard, not over counting

Low recall -> under counting, there are instances with the hazard that are not counted

[17]:

hazard_extraction = pd.concat([metrics,test_metrics],axis=1,keys=['Validation','Test']).reindex(hazards_sorted)
hazard_extraction

[17]:

	Validation					Test
	Recall	Precision	F1	Accuracy	Support	Recall	Precision	F1	Accuracy	Support
Hazardous Terrain	0.883	0.938	0.910	0.940	171	0.859	0.907	0.882	0.912	192
Ecological Resources	0.621	0.900	0.735	0.948	58	0.667	0.881	0.759	0.934	78
Thunderstorms	0.875	1.000	0.933	0.992	32	0.857	0.818	0.837	0.986	21
Wind	0.840	0.988	0.908	0.968	94	0.795	0.912	0.849	0.956	78
Dry Weather	0.791	0.973	0.873	0.958	91	0.732	0.938	0.822	0.948	82
Rain	0.872	0.872	0.872	0.980	39	0.800	0.800	0.800	0.976	30
Smoke	0.943	0.847	0.893	0.976	53	0.929	0.796	0.857	0.974	42
Evacuations	0.885	0.958	0.920	0.984	52	0.909	0.930	0.920	0.986	44
Injury	0.786	1.000	0.880	0.994	14	0.700	1.000	0.824	0.994	10
Resource Shortage	0.776	0.826	0.800	0.962	49	0.778	0.651	0.709	0.954	36
Road Closures	0.792	0.704	0.745	0.948	48	0.760	0.613	0.679	0.928	50
Command Transition	0.902	0.965	0.932	0.984	61	0.758	0.833	0.794	0.948	66
Inaccurate Mapping	0.824	0.933	0.875	0.992	17	0.913	0.840	0.875	0.988	23
Aerial Grounding	1.000	0.765	0.867	0.992	13	0.625	0.833	0.714	0.992	8
Military Base	0.667	0.857	0.750	0.992	9	0.800	0.667	0.727	0.994	5
Cultural Resources	0.810	0.810	0.810	0.956	58	0.729	0.827	0.775	0.950	59
Law Violations	1.000	1.000	1.000	1.000	3	1.000	0.667	0.800	0.998	2
Infrastructure	0.714	0.921	0.805	0.966	49	0.725	0.829	0.773	0.966	40
Livestock	0.842	0.842	0.842	0.988	19	0.765	0.619	0.684	0.976	17

To visualize what words define the hazards, we create word clouds based on word frequency using the trend analysis module.

[19]:

word_frequencies = get_word_frequencies(hazard_words_per_doc, hazards_sorted)

[20]:

build_word_clouds(word_frequencies, nrows=4, ncols=5, figsize=(16, 8), cmap=None, save=False, save_path=os.path.join('topic_model_results',''), fontsize=20)

Primary Analysis:

The primary analysis involves two main outputs: - hazard metrics - risk matrix

Hazard Metrics

Hazard metrics, including frequency, rate, and severity, are displayed in a table.

First, we reorder the dictionary containing fires for each hazard according to the sorted hazard list:

[21]:

fires = {hazard: fires[hazard] for hazard in hazards_sorted}

Next, we calculate severity and sort it according to the sorted hazard list:

[22]:

severity_total, severity_table = calc_severity(fires, incident_summary_df ,rm_all_outliers=False, rm_severity_outliers=False)

[23]:

severity_table = severity_table.set_index('Hazard').reindex(hazards_sorted).reset_index()

For comparison, we also calcluated severity values accross the entire dataset. This allows us to compare the average severity for incidents with certain hazards to a baseline value.

[24]:

severity_accross_all_incidents = []; injuries_all = []; fatalities_all = []; str_dam_all = []; str_des_all = []
for i in range(len(incident_summary_df)):
    severity = int(incident_summary_df.iloc[i]["STR_DESTROYED_TOTAL"]) + int(incident_summary_df.iloc[i]["STR_DAMAGED_TOTAL"])+ int(incident_summary_df.iloc[i]["INJURIES_TOTAL"])+ int(incident_summary_df.iloc[i]["FATALITIES"])
    severity_accross_all_incidents.append(severity)
    injuries_all.append(int(incident_summary_df.iloc[i]["INJURIES_TOTAL"])); fatalities_all.append(int(incident_summary_df.iloc[i]["FATALITIES"]))
    str_dam_all.append(int(incident_summary_df.iloc[i]["STR_DAMAGED_TOTAL"])); str_des_all.append(int(incident_summary_df.iloc[i]["STR_DESTROYED_TOTAL"]))

Now we save the total dataset information as a single-row dataframe so we can add it to the existing hazard metric table.

[25]:

total_incidents_df = pd.DataFrame({"Hazard Category": ['Total Reports'],
                                  "Hazard Name": [''],
                                  "OTTO %":[''],
                                   "Total Fire Frequency":[str(len(incident_summary_df))],
                                  "Rate":[str(round(np.average(incident_summary_df['START_YEAR'].value_counts().values),1))+"+-"+str(round(np.std(incident_summary_df['START_YEAR'].value_counts().values),1))],#len(incident_summary_df)/len(years))],
                                  "Fatalities":[str(round(np.average(fatalities_all),1))+"+-"+str(round(np.std(fatalities_all),1))],
                                  "Injuries":[str(round(np.average(injuries_all),1))+"+-"+str(round(np.std(injuries_all),1))],
                                  "Structures Damaged":[str(round(np.average(str_dam_all),1))+"+-"+str(round(np.std(str_dam_all),1))],
                                  "Structures Destroyed":[str(round(np.average(str_des_all),1))+"+-"+str(round(np.std(str_des_all),1))],
                                  "Severity":[str(round(np.average(severity_accross_all_incidents),1))+"+-"+str(round(np.std(severity_accross_all_incidents),1))]},
                                 index =['Total Reports'])

Here we rename the columns for readability:

[26]:

values = ['Fatalities', 'Injuries', 'Structures Damaged', 'Structures Destroyed']
for value in values:
    table[value] = severity_table['Average '+value].astype(str) + "+-" + severity_table['std dev '+value].astype(str)

Now we combine the dataframes to get the final primary results table:

[27]:

table['Severity'] = severity_table['formatted']
columns = ['Hazard Category', 'Hazard Name', 'OTTO %', 'Total Fire Frequency', 'Rate'] + values +['Severity']
table = table[columns]
table = table.set_index('Hazard Category')#.drop(['Hazard Category'], axis=1)
table = table.append(total_incidents_df.drop(['Hazard Category'], axis=1))
display(table)

	Hazard Name	OTTO %	Total Fire Frequency	Rate	Fatalities	Injuries	Structures Damaged	Structures Destroyed	Severity
Environment	Hazardous Terrain	53.3+-36.5	2900	322.2+-120.9	0.0+-0.5	1.4+-3.9	0.8+-10.1	4.5+-41.1	6.8+-46.4
Environment	Ecological Resources	46.8+-35.1	792	88.0+-29.2	0.0+-0.4	2.5+-5.7	1.5+-18.4	6.4+-34.3	10.5+-50.0
Environment	Thunderstorms	55.1+-35.8	1127	125.2+-51.2	0.0+-0.5	1.8+-4.8	1.1+-15.6	5.0+-33.6	8.0+-45.8
Environment	Wind	51.8+-36.3	2950	327.8+-120.2	0.0+-0.5	1.3+-3.9	1.1+-12.1	6.2+-53.2	8.6+-60.1
Environment	Dry Weather	55.4+-36.3	2171	241.2+-92.6	0.0+-0.5	1.5+-4.2	1.4+-13.8	7.6+-61.6	10.6+-69.5
Environment	Rain	64.4+-37.7	1696	188.4+-42.7	0.0+-0.5	1.2+-3.4	1.2+-14.3	5.0+-49.5	7.4+-56.5
Environment	Smoke	48.6+-37.5	1281	142.3+-33.4	0.0+-0.5	1.8+-5.0	1.4+-15.3	7.6+-58.9	10.8+-69.1
Mission	Evacuations	36.0+-31.0	1296	144.0+-58.4	0.1+-0.7	2.5+-5.5	2.5+-18.0	14.6+-80.1	19.6+-90.0
Mission	Injury	56.0+-34.3	783	87.0+-31.7	0.1+-0.5	3.9+-6.2	1.9+-18.5	14.1+-95.0	20.0+-104.9
Mission	Resource Shortage	40.2+-33.7	1229	136.6+-58.3	0.0+-0.6	2.4+-5.4	1.5+-15.5	8.4+-60.0	12.4+-70.8
Mission	Road Closures	44.5+-34.6	1726	191.8+-67.6	0.0+-0.4	2.0+-4.8	1.9+-16.0	9.9+-69.5	13.9+-78.6
Mission	Command Transition	60.8+-38.2	1868	207.6+-55.0	0.0+-0.6	2.2+-4.9	1.7+-15.4	9.1+-66.7	13.1+-75.6
Mission	Inaccurate Mapping	65.8+-34.3	1383	153.7+-47.5	0.0+-0.2	1.8+-4.6	1.0+-9.6	5.7+-32.4	8.5+-39.0
Mission	Aerial Grounding	32.1+-26.7	149	16.6+-7.6	0.1+-0.8	6.3+-9.7	5.3+-40.2	10.4+-45.3	22.1+-86.3
Wildland Urban Interface	Military Base	58.0+-35.7	83	9.2+-3.5	0.1+-0.5	2.9+-6.8	1.1+-3.9	17.6+-74.9	21.7+-82.8
Wildland Urban Interface	Cultural Resources	44.9+-35.0	865	96.1+-29.7	0.0+-0.4	2.6+-5.4	1.5+-10.1	10.5+-70.9	14.6+-78.5
Wildland Urban Interface	Law Violations	78.2+-31.3	328	36.4+-21.4	0.1+-0.6	1.0+-3.1	1.5+-11.1	7.9+-43.4	10.4+-51.0
Wildland Urban Interface	Infrastructure	50.2+-35.0	877	97.4+-35.4	0.1+-0.8	2.4+-6.0	2.6+-20.2	15.5+-93.9	20.6+-104.7
Wildland Urban Interface	Livestock	40.2+-34.9	530	58.9+-31.2	0.1+-0.5	2.6+-6.3	0.9+-5.0	11.8+-82.1	15.4+-88.6
Total Reports			8991	999.0+-312.3	0.0+-0.3	0.6+-2.5	0.6+-8.3	2.6+-31.4	3.8+-36.3

We can also create a table with just the severity metrics:

[29]:

avg_injuries = round(np.average(injuries_all))
avg_fatalities = round(np.average(fatalities_all))
avg_des = round(np.average(str_des_all))
avg_dam = round(np.average(str_dam_all))
avg_df = pd.DataFrame({"Total Avg Injuries":[avg_injuries for hazard in hazards],
                     "Total Avg Fatalities":[avg_fatalities for hazard in hazards],
                     "Total Avg Str Dam":[avg_dam for hazard in hazards],
                     "Total Avg Str Des":[avg_des for hazard in hazards]})
severity_results = pd.DataFrame({""})

[30]:

ICS_results = pd.concat([table.drop(['Total Reports']).reset_index(drop=True), severity_table, avg_df], axis=1)
#ICS_results.to_csv(os.path.join(os.path.dirname(os.getcwd()),'results','ICS_hazards.csv'))

Risk Matrix of Hazards (rate by severity)

The next part of the primary analysis is to create a risk matrix placing hazards in risk categories according to severity and rate. Risk matrices can be created according to either FAA or USFS specifications.

For display purposes, initialize the matplotlib with the following parameters:

[31]:

matplotlib.style.use("default")
plt.rcParams["font.family"] = "Times New Roman"

[32]:

#ICS_results = pd.read_csv(os.path.join(os.path.dirname(os.getcwd()),'results','ICS_hazards.csv'))

Set the index of the results dataframe to the hazards:

[33]:

ICS_results.index = ICS_results['Hazard Name']

Calculate the severity category from the results:

[34]:

severities = get_ICS_severity_FAA(ICS_results, hazards)

Calculate the likelihood category from the rates:

[35]:

rates = {hazard:float(table[table['Hazard Name']==hazard]['Rate'].values[0].split("+-")[0]) for hazard in hazards}
rates_FAA = get_likelihood_ICS_FAA(rates)

Now you can plot a risk matrix:

[36]:

plot_risk_matrix(rates_FAA, severities, figsize=(9,8), results_path=os.path.join('risk_matrix'), save=False, max_chars=22, fontsize=10)

To produces the USFS risk matrix, follow the same method and calculate the likelihood and severity categories:

[37]:

likelihoods = get_likelihood_ICS_USFS(rates)
severities = get_ICS_severity_USFS(ICS_results, hazards)

[38]:

severities

[38]:

{'Evacuations': 'Critical',
 'Hazardous Terrain': 'Marginal',
 'Inaccurate Mapping': 'Marginal',
 'Ecological Resources': 'Marginal',
 'Command Transition': 'Marginal',
 'Wind': 'Marginal',
 'Dry Weather': 'Marginal',
 'Rain': 'Marginal',
 'Law Violations': 'Marginal',
 'Road Closures': 'Marginal',
 'Smoke': 'Marginal',
 'Military Base': 'Critical',
 'Cultural Resources': 'Marginal',
 'Resource Shortage': 'Marginal',
 'Thunderstorms': 'Marginal',
 'Infrastructure': 'Critical',
 'Injury': 'Critical',
 'Livestock': 'Marginal',
 'Aerial Grounding': 'Critical'}

Now plot the risk matrix:

[39]:

plot_USFS_risk_matrix(likelihoods, severities, figsize=(9,8), results_path=os.path.join('risk_matrix'), save=False, max_chars=24,fontsize=12)#fontsize)

Graphic Analysis:

In the graphic analysis, we produce time series graphs of hazard metrics and predictors over time. - Time Series - Hazard Metrics: OTTO, Severity, Frequency - Predictors

Hazard Metrics Time Series

Here we graph time series for relevant hazard metrics. The metrics of interest in this example are: - frequency, - Operational Time To Occurrence (OTTO) in percent containment - severity

First we set the graphing parameters for consistency. For the time series we prefer seaborn style. Then, we format the data to be in the same order of the sorted hazards.

[40]:

matplotlib.style.use("seaborn")
plt.rcParams["font.family"] = "Times New Roman"
categories = table.index[:-1]#['Hazard Category'][:-1].index
metric_data = [time_of_occurence_days, time_of_occurence_pct_contained, frequency, frequency_fires]
time_of_occurence_days = {hazard: time_of_occurence_days[hazard] for hazard in hazards_sorted}
time_of_occurence_pct_contained = {hazard: time_of_occurence_pct_contained[hazard] for hazard in hazards_sorted}
frequency = {hazard: frequency[hazard] for hazard in hazards_sorted}
frequency_fires = {hazard: frequency_fires[hazard] for hazard in hazards_sorted}

[41]:

categories

[41]:

Index(['Environment', 'Environment', 'Environment', 'Environment',
       'Environment', 'Environment', 'Environment', 'Mission', 'Mission',
       'Mission', 'Mission', 'Mission', 'Mission', 'Mission',
       'Wildland Urban Interface', 'Wildland Urban Interface',
       'Wildland Urban Interface', 'Wildland Urban Interface',
       'Wildland Urban Interface'],
      dtype='object')

Now we can input the metrics into the graph_ICS_time_series utility function, which uses graphing methods from the trend analysis module. This results in four graphs, one for the OTTO in days, OTTO in perect containment, total frequency (i.e., number of reports with hazards - can have multiple per fire), and the fire frequency (i.e., number of fires with hazards)

[42]:

graph_ICS_time_series(time_of_occurence_days, time_of_occurence_pct_contained, frequency, frequency_fires, hazards, categories, save=False, std_dev=False, results_path=os.getcwd(), figsize=figsize, fontsize=fontsize, titles=False)

To prepare for the secondary analysis, we minmix scale the frequency:

[43]:

frequencies_fire = {hazard: [frequency_fires[hazard][year] for year in frequency_fires[hazard]] for hazard in frequency_fires}
fire_freqs_scaled = {hazard: minmax_scale(frequencies_fire[hazard]) for hazard in frequencies_fire}

Predictor Time Series

We are also interested in how potential predictors vary accross time. The predictors of interest here are: - fire characteristics - operations - intensity

Each of the predictors above are calculated by combining multiple lower level predictors. Hence, we first define the combine_predictors function and graph each of the subpredictors.

[44]:

def combine_predictors(predictors=[], scale=True):
    max_weight = 1/len(predictors)
    num_values = len(predictors[-1])
    if scale:
        variable_weights = [minmax_scale(p) for p in predictors]
    else:
        variable_weights = predictors
    combined_vars = [[max_weight*var_weight for var_weight in var_weight_list] for var_weight_list in variable_weights]
    combined_vars = [sum([combined_vars[var][i] for var in range(len(combined_vars))]) for i in range(num_values)]
    return combined_vars

[45]:

combined_predictors = pd.DataFrame()

Fire Characteristics

We start with the fire characteristics combined predictor. It is formed from the fire frequency, acres burned, and the number of days a fire burns on average per year. Potentially add FSR (WF_MAX_FSR), number of complexes (COMPLEX), evacuations (EVACUATION_REPORTED).

We define the columns of interest, then filter the incident summary reports by these columns:

[46]:

fire_trends_cols = ["FINAL_ACRES", "FOD_DISCOVERY_DOY", "FOD_CONTAIN_DOY", "START_YEAR"]
fire_trends_df = incident_summary_df[fire_trends_cols]

We get the fire frequency per year by counting the number of reports per year:

[47]:

counts = fire_trends_df["START_YEAR"].value_counts()
count = {int(year):counts[year] for year in counts.index.sort_values()}

[48]:

years = count.keys()

Next we calculate the total and average days burning and acreage per year:

[49]:

average_days_burning = {}
total_days_burning = {}
total_acres = {}
average_acres = {}
for year in years:
    temp_df = fire_trends_df.loc[fire_trends_df['START_YEAR']==year]
    list_of_days_burning = [temp_df.iloc[i]["FOD_CONTAIN_DOY"]-temp_df.iloc[i]['FOD_DISCOVERY_DOY'] for i in range(len(temp_df.dropna(subset=['FOD_DISCOVERY_DOY', "FOD_CONTAIN_DOY"]).reset_index(drop=True)))]
    average_days_burning[year] = np.average(list_of_days_burning)
    total_days_burning[year] = np.sum(list_of_days_burning)
    list_of_acres = temp_df['FINAL_ACRES'].dropna().tolist()
    average_acres[year] = np.average(list_of_acres)
    total_acres[year] = np.sum(list_of_acres)
#print(total_days_burning)

Now we can calculate the combined predictor variable, fire characteristics, using the function defined above.

[50]:

fire_predictors = [total_acres.values(), counts, total_days_burning.values()]

[51]:

fire_predictors = [total_acres.values(), counts, total_days_burning.values()]
combined_predictors['Fire Characteristics'] = combine_predictors(fire_predictors)
combined_predictors.index = years

Graphs

We graph the average and total values for the subpredictors comprising fire characteristics.

First we scale the predictors using minmax scaling:

[52]:

av_acres = average_acres.values()
av_days_burn = average_days_burning.values()
count = count.values()
freq_scaled = minmax_scale(count)
av_days_burn_scaled = minmax_scale(av_days_burn)
av_acres_scaled = minmax_scale(av_acres)

total_days_burn = total_days_burning.values()
total_acre = total_acres.values()
total_days_burn_scaled = minmax_scale(total_days_burn)
total_acres_scaled = minmax_scale(total_acre)

Now we graph the predictors:

[53]:

nrows = 1
ncols = 2
fig, axs = plt.subplots(nrows = nrows,
                            ncols = ncols,
                            figsize = (10,4))
fire_labels = ['Acres', 'Frequency', 'Days Burning']
fire_totals = [total_acres_scaled, freq_scaled, total_days_burn_scaled]
fire_avgs = [av_acres_scaled, freq_scaled, av_days_burn_scaled]
fig, axs[0] = plot_predictors(fire_totals, fire_labels, time=years, time_label='Year', title="Change in Fire Characteristics from 2006-2014",
                totals=True, averages=False, scaled=True, figsize=(12, 5), axs=axs[0], fig=fig, show=False, legend=False)
fig, axs[1] = plot_predictors(fire_avgs, fire_labels, time=years, time_label='Year', title="Change in Fire Characteristics from 2006-2014",
                totals=False, averages=True, scaled=True, figsize=(12, 5), axs=axs[1], fig=fig, show=False)
plt.show()

Operations

Next we examine the operational trends predictor, which is defined by aerial assets (total and max in one day), personnel (total and max in one day), and projected cost. Could also potentially add number of sit reports (INC_MGMT_NUM_SITREPS)??

First we identify the columns of interest and filter the incident summary dataframe.

[54]:

operational_trends_cols = ["TOTAL_AERIAL_SUM", "TOTAL_PERSONNEL_SUM", "WF_PEAK_AERIAL", "WF_PEAK_PERSONNEL", "START_YEAR","PROJECTED_FINAL_IM_COST"]
operational_trends_df = incident_summary_df[operational_trends_cols]

Next we calculate the average and total values for the sub predictors:

[55]:

total_aerial = {}
average_aerial = {}
total_person = {}
average_person = {}
total_cost = {}
average_cost = {}
for year in years:
    list_of_person = []
    list_of_aerial = []
    temp_df = operational_trends_df.loc[operational_trends_df['START_YEAR']==year]
    list_of_person = temp_df['WF_PEAK_PERSONNEL'].fillna(value=0).tolist()
    list_of_aerial = temp_df["WF_PEAK_AERIAL"].fillna(value=0).tolist()
    list_of_cost = temp_df["PROJECTED_FINAL_IM_COST"].dropna().tolist()
    average_aerial[year] = np.average(list_of_aerial)
    total_aerial[year] = np.sum(list_of_aerial)
    average_person[year] = np.average(list_of_person)
    total_person[year] = np.sum(list_of_person)
    average_cost[year] = np.average(list_of_cost)
    total_cost[year] = np.sum(list_of_cost)

Now we calculate the combined operations predictor variable:

[56]:

ops_predictors = [total_cost.values(), total_aerial.values(), total_person.values()]
combined_predictors['Operations'] = combine_predictors(ops_predictors)

Now we grab the values and minmax scale them for graphing:

[57]:

av_aerial = average_aerial.values()
total_aerial = total_aerial.values()

av_person = average_person.values()
total_person = total_person.values()
av_cost = average_cost.values()
total_cost = total_cost.values()

[58]:

av_cost_scaled = minmax_scale(av_cost)
av_person_scaled = minmax_scale(av_person)
av_aerial_scaled = minmax_scale(av_aerial)

total_cost_scaled = minmax_scale(total_cost)
total_person_scaled = minmax_scale(total_person)
total_aerial_scaled = minmax_scale(total_aerial)

Graphs

We then graph each of the subpredictors together, both the averages and total values.

[59]:

nrows = 1
ncols = 2
fig, axs = plt.subplots(nrows = nrows,
                            ncols = ncols,
                            figsize = (10,4))
operations_labels = ['Cost', 'Personnel', 'Aerial Assets']
operations_totals = [total_cost_scaled, total_person_scaled, total_aerial_scaled]
operations_avgs = [av_cost_scaled, av_person_scaled, av_aerial_scaled]
fig, axs[0] = plot_predictors(operations_totals, operations_labels, time=years, time_label='Year', title="Change in Operations from 2006-2014",
                totals=True, averages=False, scaled=True, figsize=(12, 5), axs=axs[0], fig=fig, show=False, legend=False)
fig, axs[1] = plot_predictors(operations_avgs, operations_labels, time=years, time_label='Year', title="Change in Operations from 2006-2014",
                totals=False, averages=True, scaled=True, figsize=(12, 5), axs=axs[1], fig=fig, show=False)
plt.show()

Intensity

The final predictor we consider is the intensity predictor, which is defined by the number of injuries, number of fatalities, number of structures damaged, number of structures destroyed.

Again, first we identify the relevant columns and filter the incident dataframe to include them:

[60]:

intensity_cols = ["STR_DESTROYED_TOTAL","STR_DAMAGED_TOTAL","INJURIES_TOTAL","FATALITIES", "START_YEAR"]
intensity_df = incident_summary_df[intensity_cols]

Now we caluclate the total and average values for each of the subpredictors:

[61]:

total_str_des = {}
average_str_des = {}
total_str_damage = {}
average_str_damage = {}
total_injuries = {}
average_injuries = {}
total_fatalities = {}
average_fatalities = {}

for year in years:
    temp_df =intensity_df.loc[intensity_df['START_YEAR']==year]
    list_of_dest = temp_df["STR_DESTROYED_TOTAL"].tolist()
    list_of_dam = temp_df["STR_DAMAGED_TOTAL"].tolist()
    list_of_injury = temp_df["INJURIES_TOTAL"].tolist()
    list_of_fatalities = temp_df["FATALITIES"].tolist()
    total_str_des[year] = np.sum(list_of_dest)
    average_str_des[year] = np.average(list_of_dest)
    total_str_damage[year] = np.sum(list_of_dam)
    average_str_damage[year] = np.average(list_of_dam)
    total_injuries[year] = np.sum(list_of_injury)
    average_injuries[year] = np.average(list_of_injury)
    total_fatalities[year] = np.sum(list_of_fatalities)
    average_fatalities[year] = np.average(list_of_fatalities)

The combined intensity predictor is then calculated:

[62]:

intensity_predictors = [total_fatalities.values(), total_str_damage.values(), total_injuries.values(), total_str_des.values()]
combined_predictors['Intensity'] = combine_predictors(intensity_predictors)

Once again, we grab the values and minmax scale them for graphing:

[63]:

av_des = average_str_des.values()
total_des = total_str_des.values()
av_damage = average_str_damage.values()
total_damage = total_str_damage.values()
av_injury = average_injuries.values()
total_injury = total_injuries.values()
av_fatality = average_fatalities.values()
total_fatality = total_fatalities.values()

[64]:

total_fatality_scaled = minmax_scale(total_fatality)
total_injury_scaled = minmax_scale(total_injury)
total_damage_scaled = minmax_scale(total_damage)
total_des_scaled = minmax_scale(total_des)

av_fatality_scaled = minmax_scale(av_fatality)
av_injury_scaled = minmax_scale(av_injury)
av_damage_scaled = minmax_scale(av_damage)
av_des_scaled = minmax_scale(av_des)

Graphs

The subpredictors of intensity are graphed together as a time series:

[65]:

nrows = 1
ncols = 2
fig, axs = plt.subplots(nrows = nrows,
                            ncols = ncols,
                            figsize = (10,4))
intensity_labels = ['Fatalities', 'Injuries', 'Damaged Structures', 'Destroyed Structures']
intensity_totals = [total_fatality_scaled, total_injury_scaled, total_damage_scaled, total_des_scaled]
intensity_avgs = [av_fatality_scaled, av_injury_scaled, av_damage_scaled, av_des_scaled]
fig, axs[0] = plot_predictors(intensity_totals, intensity_labels, time=years, time_label='Year', title="Change in Intensity from 2006-2014",
                totals=True, averages=False, scaled=True, figsize=(12, 5), axs=axs[0], fig=fig, show=False, legend=False)
fig, axs[1] = plot_predictors(intensity_avgs, intensity_labels, time=years, time_label='Year', title="Change in Intensity from 2006-2014",
                totals=False, averages=True, scaled=True, figsize=(12, 5), axs=axs[1], fig=fig, show=False)
plt.show()

Combined predictors time series

To see how the predictors compare together, we graph all three of the combined predictors and their subpredictors in one graph. We graph both the total and average values.

First we define dictionaries to contain the data and minmax scale the total data values:

[66]:

totals = {"Fire Frequency": count,
    "total Days Fires Burned": total_days_burn,
    "total Acres Fires Burned": total_acre,
    "total Aerial Assets": total_aerial,
    "total Personnel": total_person,
    "total Cost": total_cost,
    "total Structures Damaged": total_damage,
    "total Structures Destroyed": total_des,
    "total Injuries": total_injury,
    "total Fatalities": total_fatality}
totals_df = pd.DataFrame(totals)

averages = {
    "fire frequency": count,
    "average days fire burns": av_days_burn,
    "average acres fire burns": av_acres,
    "average aerial assets per fire": av_aerial,
    "average personnel per fire": av_person,
    "average cost per fire": av_cost,
    "average structures damaged per fire": av_damage,
    "average structures destroyed per fire": av_des,
    "average injuries per fire": av_injury,
    "average fatalities per fire": av_fatality}
avs_df = pd.DataFrame(averages)

totals_scaled = {feature:minmax_scale(totals[feature]) for feature in totals}

Next we scale to combined predictors as well:

[67]:

combined_predictors_scaled = combined_predictors.copy()
for col in combined_predictors_scaled:
    combined_predictors_scaled[col] = minmax_scale(combined_predictors_scaled[col])

Now we define the line style, markers, and colors for the predictor graphs:

[68]:

lines = {"Fire Frequency": '--',
    "total Days Fires Burned": '--',
    "total Acres Fires Burned": '--',
    "total Aerial Assets": '-',
    "total Personnel": '-',
    "total Cost": '-',
    "total Structures Damaged": ':',
    "total Structures Destroyed": ':',
    "total Injuries": ':',
    "total Fatalities": ':'}
colors = cm.tab10(np.linspace(0, 1, len(lines)))
colors_dict = {}
i = 0
for feature in lines:
    colors_dict[feature] = colors[i]
    i+=1
markers = {"Fire Frequency": '.',
    "total Days Fires Burned": 'v',
    "total Acres Fires Burned": '^',
    "total Aerial Assets": 's',
    "total Personnel": 'p',
    "total Cost": 'P',
    "total Structures Damaged": 'h',
    "total Structures Destroyed": 'X',
    "total Injuries": 'D',
    "total Fatalities": '*'}

Here we plot the total scaled value for each predictor:

[69]:

plt.figure(figsize=figsize)
plt.ylabel("Total Sum Scaled", fontsize=fontsize)
plt.xlabel("Year", fontsize=fontsize)
for feature in totals_scaled:
    plt.plot(years, totals_scaled[feature], label=feature.replace("total ",""), linestyle=lines[feature], marker=markers[feature], color=colors_dict[feature])
    plt.tick_params(labelsize=fontsize)
plt.plot(years,combined_predictors_scaled['Fire Characteristics'], label='Fire Characteristics', color='black', linestyle='--')
plt.plot(years,combined_predictors_scaled['Operations'], label = 'Operations', color='black', linestyle = '-')
plt.plot(years,combined_predictors_scaled['Intensity'], label = 'Intensity', color = 'black', linestyle = ':')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=fontsize)
#plt.savefig('predictors_scaled.pdf', bbox_inches="tight")
plt.show()

Next we graph the unscaled values:

[70]:

combined_predictors_unscaled = pd.DataFrame()
fire_predictors = [total_acres.values(), counts, total_days_burning.values()]
combined_predictors_unscaled['Fire Characteristics'] = combine_predictors(fire_predictors, scale=False)
combined_predictors_unscaled.index = years
intensity_predictors = [total_fatalities.values(), total_str_damage.values(), total_injuries.values(), total_str_des.values()]
ops_predictors = [total_cost, total_aerial, total_person]
combined_predictors_unscaled['Operations'] = combine_predictors(ops_predictors, scale=False)
combined_predictors_unscaled['Intensity'] = combine_predictors(intensity_predictors, scale=False)

[71]:

plt.figure()
plt.ylabel("Total", fontsize=16)
plt.xlabel("Year", fontsize=16)
for feature in totals:
    plt.plot(years, totals[feature], label=feature.replace("total ",""), linestyle=lines[feature], marker=markers[feature], color=colors_dict[feature])
    plt.tick_params(labelsize=16)
plt.plot(years,combined_predictors_unscaled['Fire Characteristics'], label='Fire Characteristics', color='black', linestyle='--')
plt.plot(years,combined_predictors_unscaled['Operations'], label = 'Operations', color='black', linestyle = '-')
plt.plot(years,combined_predictors_unscaled['Intensity'], label = 'Intensity', color = 'black', linestyle = ':')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=16)
plt.yscale('log')
#plt.savefig('predictors.pdf', bbox_inches="tight")
plt.show()

Secondary Analysis:

The secondary analysis includes inferential statistics methods, specifically: - Correlation matrix - Multiple Regression

Combined Predictors

We use the combined predictors, rather than the raw predictors, for the secondary analysis.

Correlation Matrix

A correlation matrix is produced to examine the pairwise relationship between hazard frequency and each predictor.

This is done using the create_correlation_matrix function in the trend analysis module. The scaled predictors and hazard frequencies are passed into the function arguments:

[72]:

corrMatrix_fires, correlation_mat_total_fires, p_values = create_correlation_matrix(combined_predictors_scaled, fire_freqs_scaled, graph=False, figsize=(9,8), fontsize=12)

For display purposes, the correlation matrix can be reshaped to show hazards on one axis and predictors on the other:

[73]:

predictors = [p for p in combined_predictors_scaled]
hazards = [h for h in fire_freqs_scaled]
reshape_correlation_matrix(corrMatrix_fires, p_values, predictors, hazards)

The full correlation matrix has correlations between all hazard and predictor pairs:

[74]:

corrMatrix_fires, correlation_mat_total_fires, p_values = create_correlation_matrix(combined_predictors_scaled, fire_freqs_scaled, graph=True, figsize=(8,6), fontsize=11, save=False, results_path=os.path.join('correlation_matrix'))

Multiple regression

Typically multiple regression is used as a prediction algorithm where given a certain set of continuous inputs X=(x1,2,…xn), the regression predicts the value of continuous variable, y. Multiple regression uses a linear combination of X to produce y, and the error in y/goodness of fit indicates how good the model and predictors are at predicting the target.

The importance of a predictor, xi, is evaluated by shuffling its input values, and seeing how the goodness of fit/error changes.

Our goal:

use regression to determine what the most important predictors are for the frequency of hazards. Since we have a limited number of data points (9) we will not be predicting on unseen data.

Inputs/Predictors:

All operations trends, fire characteristics, and intensity

Output/y:

annual frequency of hazards time series

Method:

for each hazard, use its frequency time series: 1. fit linear regression model to the X,y 2. calculate correlation coefficient For each Xi: 3. record the regression coefficient (beta)

Future goal: use ML to determine whether or not a hazard will occur based on past incident reports

The output of the regression analysis is two graphs and a dataframe. The first graph shows the correlation coeffiecient (R2) for the full model with all three predictors for each hazard. The second graph show the importance of each predictor for each hazard. The results dataframes store the numeric information from the graphs.

[75]:

predictors = [p for p in combined_predictors_scaled]
hazards = [h.replace("total ","") for h in fire_freqs_scaled]
results_df, delta_df, coefficient_df = multiple_reg_feature_importance(predictors, hazards, correlation_mat_total_fires,
                                                       save=False, results_path=os.path.join('multiple_regression'),
                                                      r2_figsize=figsize, r2_fontsize=fontsize-4, predictor_import_figsize = (9, 2), predictor_import_fontsize=fontsize-4)
display(results_df, delta_df, coefficient_df)

	hazard	R2 for full model	MSE for full model	Fire Characteristics removed score	Operations removed score	Intensity removed score	Fire Characteristics removed MSE	Operations removed MSE	Intensity removed MSE
0	Hazardous Terrain	0.398	0.048	0.409	0.406	-0.237	0.047	0.048	0.099
1	Ecological Resources	0.377	0.089	0.039	0.123	-0.113	0.138	0.126	0.160
2	Thunderstorms	0.511	0.047	0.450	0.471	-1.205	0.053	0.051	0.212
3	Wind	0.552	0.042	0.351	-0.748	-1.659	0.061	0.165	0.251
4	Dry Weather	0.426	0.062	0.367	0.361	-0.381	0.068	0.069	0.149
5	Rain	0.639	0.036	0.145	-0.455	-1.419	0.086	0.146	0.242
6	Smoke	0.154	0.084	0.129	0.161	-0.146	0.086	0.083	0.114
7	Evacuations	0.301	0.087	0.311	0.287	-0.505	0.086	0.089	0.188
8	Injury	0.323	0.067	0.320	-0.393	0.276	0.067	0.137	0.071
9	Resource Shortage	0.347	0.072	0.173	0.212	-0.158	0.091	0.087	0.127
10	Road Closures	0.358	0.064	0.369	0.240	0.051	0.062	0.075	0.094
11	Command Transition	0.437	0.066	0.437	0.450	-0.789	0.066	0.064	0.209
12	Inaccurate Mapping	0.412	0.059	0.392	0.417	-0.271	0.061	0.059	0.128
13	Aerial Grounding	0.793	0.027	-1.452	0.303	-0.084	0.321	0.091	0.142
14	Military Base	0.600	0.048	0.014	-0.170	0.230	0.118	0.140	0.092
15	Cultural Resources	0.328	0.055	0.246	-2.172	0.216	0.061	0.258	0.064
16	Law Violations	0.320	0.069	-0.475	-0.532	0.213	0.150	0.156	0.080
17	Infrastructure	0.192	0.092	0.173	0.198	-0.079	0.094	0.091	0.122
18	Livestock	0.336	0.059	0.354	0.322	-0.529	0.057	0.060	0.135

	hazard	R2 for full model	MSE for full model	Fire Characteristics removed score	Operations removed score	Intensity removed score	Fire Characteristics removed MSE	Operations removed MSE	Intensity removed MSE
0	Hazardous Terrain	0.398	0.048	0.010	0.008	0.636	0.001	0.001	-0.051
1	Ecological Resources	0.377	0.089	0.338	0.254	0.490	-0.049	-0.037	-0.070
2	Thunderstorms	0.511	0.047	0.061	0.040	1.716	-0.006	-0.004	-0.165
3	Wind	0.552	0.042	0.201	1.301	2.211	-0.019	-0.123	-0.209
4	Dry Weather	0.426	0.062	0.059	0.065	0.807	-0.006	-0.007	-0.087
5	Rain	0.639	0.036	0.495	1.094	2.059	-0.050	-0.110	-0.206
6	Smoke	0.154	0.084	0.025	0.007	0.300	-0.002	0.001	-0.030
7	Evacuations	0.301	0.087	0.010	0.013	0.805	0.001	-0.002	-0.101
8	Injury	0.323	0.067	0.003	0.716	0.048	-0.000	-0.070	-0.005
9	Resource Shortage	0.347	0.072	0.174	0.134	0.505	-0.019	-0.015	-0.055
10	Road Closures	0.358	0.064	0.011	0.118	0.307	0.001	-0.012	-0.030
11	Command Transition	0.437	0.066	0.000	0.013	1.226	-0.000	0.002	-0.143
12	Inaccurate Mapping	0.412	0.059	0.021	0.005	0.684	-0.002	0.001	-0.069
13	Aerial Grounding	0.793	0.027	2.245	0.489	0.877	-0.294	-0.064	-0.115
14	Military Base	0.600	0.048	0.587	0.771	0.370	-0.070	-0.092	-0.044
15	Cultural Resources	0.328	0.055	0.082	2.500	0.112	-0.007	-0.203	-0.009
16	Law Violations	0.320	0.069	0.795	0.852	0.106	-0.081	-0.087	-0.011
17	Infrastructure	0.192	0.092	0.019	0.006	0.272	-0.002	0.001	-0.031
18	Livestock	0.336	0.059	0.018	0.013	0.864	0.002	-0.001	-0.076

	Hazardous Terrain	Ecological Resources	Thunderstorms	Wind	Dry Weather	Rain	Smoke	Evacuations	Injury	Resource Shortage	Road Closures	Command Transition	Inaccurate Mapping	Aerial Grounding	Military Base	Cultural Resources	Law Violations	Infrastructure	Livestock
Fire Characteristics	0.095744	-0.313514	-0.105875	0.318352	0.241583	0.410931	0.132562	0.030633	-0.008230	-0.175735	0.067870	-0.005561	0.114858	-0.931697	0.657452	-0.113661	0.645235	0.216450	0.075068
Operations	-0.022295	0.309108	-0.211922	-0.694471	-0.200577	-0.642130	-0.136309	-0.146819	0.436775	0.177629	0.154156	-0.102439	-0.013382	0.463498	0.578210	0.774194	-0.530265	-0.083715	-0.151685
Intensity	0.589151	0.659442	1.020362	1.178706	0.771584	1.156592	0.469682	0.801286	0.170047	0.603995	0.464643	0.895520	0.654770	0.843958	-0.560887	-0.258116	0.361587	0.431959	0.701494

Full predictors

As an experiment, we also perform the secondary analysis with the full set of subpredictors. This is not recommended for robust analysis due to the small sample of years in the hazard dataset. Specifically, there are more predictors than years in the dataset so the regression model will always have a perfect fit.

[76]:

corrMatrix_fires, correlation_mat_total_fires, p_values = create_correlation_matrix(totals_scaled, fire_freqs_scaled, graph=False)

[77]:

predictors = [p for p in totals_scaled]
hazards = [h for h in fire_freqs_scaled]
reshape_correlation_matrix(corrMatrix_fires, p_values, predictors, hazards)

[78]:

predictors = [p for p in totals_scaled]
hazards = [h.replace("total ","") for h in fire_freqs_scaled]
results_df, delta_df, coefficient_df = multiple_reg_feature_importance(predictors, hazards, correlation_mat_total_fires)
display(results_df, delta_df, coefficient_df)

	hazard	R2 for full model	Fire Frequency removed score	total Days Fires Burned removed score	total Acres Fires Burned removed score	total Aerial Assets removed score	total Personnel removed score	total Cost removed score	total Structures Damaged removed score	...	Fire Frequency removed MSE	total Days Fires Burned removed MSE	total Acres Fires Burned removed MSE	total Aerial Assets removed MSE	total Personnel removed MSE	total Cost removed MSE	total Structures Damaged removed MSE	total Structures Destroyed removed MSE	total Injuries removed MSE	total Fatalities removed MSE
0	Hazardous Terrain	1.0	0.680	0.206	-0.237	0.553	0.450	0.787	0.523	...	0.026	0.064	0.099	0.036	0.044	0.017	0.038	0.018	0.015	0.022
1	Ecological Resources	1.0	-0.082	0.696	-1.720	-1.004	0.230	-0.698	0.570	...	0.155	0.044	0.391	0.288	0.110	0.244	0.062	0.014	0.010	0.050
2	Thunderstorms	1.0	0.450	0.353	0.784	1.000	0.881	0.977	0.810	...	0.053	0.062	0.021	0.000	0.011	0.002	0.018	0.055	0.012	0.005
3	Wind	1.0	0.894	0.694	-1.538	-1.166	0.810	0.891	0.974	...	0.010	0.029	0.240	0.205	0.018	0.010	0.002	0.031	0.024	0.014
4	Dry Weather	1.0	0.877	0.736	-2.479	-2.215	0.448	0.040	0.714	...	0.013	0.028	0.375	0.347	0.060	0.103	0.031	0.014	0.016	0.035
5	Rain	1.0	0.893	0.942	-0.920	-2.795	1.000	0.466	1.000	...	0.011	0.006	0.192	0.380	0.000	0.053	0.000	0.027	0.005	0.002
6	Smoke	1.0	0.991	0.516	0.105	-0.402	0.802	0.019	-0.092	...	0.001	0.048	0.089	0.139	0.020	0.097	0.108	0.033	0.001	0.035
7	Evacuations	1.0	0.501	0.802	-2.623	-2.489	0.437	-0.450	0.583	...	0.062	0.025	0.453	0.436	0.070	0.181	0.052	0.022	0.009	0.048
8	Injury	1.0	0.481	0.322	0.088	0.998	0.259	0.177	0.006	...	0.051	0.067	0.090	0.000	0.073	0.081	0.098	0.020	0.002	0.103
9	Resource Shortage	1.0	0.198	0.441	-1.338	-0.769	0.205	-0.071	0.493	...	0.088	0.061	0.257	0.194	0.087	0.118	0.056	0.011	0.015	0.031
10	Road Closures	1.0	0.885	0.444	-0.509	0.163	0.449	-0.038	0.294	...	0.011	0.055	0.149	0.083	0.055	0.103	0.070	0.018	0.006	0.052
11	Command Transition	1.0	0.352	0.980	0.491	0.970	0.950	0.630	0.745	...	0.076	0.002	0.059	0.003	0.006	0.043	0.030	0.077	0.000	0.016
12	Inaccurate Mapping	1.0	0.873	0.783	0.183	0.624	0.808	0.379	0.426	...	0.013	0.022	0.082	0.038	0.019	0.062	0.058	0.043	0.001	0.024
13	Aerial Grounding	1.0	-0.176	0.723	0.801	0.943	0.842	-0.085	0.996	...	0.154	0.036	0.026	0.007	0.021	0.142	0.001	0.013	0.004	0.003
14	Military Base	1.0	-0.645	0.989	-8.206	-3.829	-1.656	0.755	0.902	...	0.197	0.001	1.100	0.577	0.317	0.029	0.012	0.066	0.048	0.037
15	Cultural Resources	1.0	0.530	0.607	0.553	0.999	0.227	-0.268	-0.694	...	0.038	0.032	0.036	0.000	0.063	0.103	0.138	0.007	0.000	0.050
16	Law Violations	1.0	0.810	0.999	-9.595	-18.816	0.125	-0.861	0.908	...	0.019	0.000	1.078	2.017	0.089	0.189	0.009	0.017	0.032	0.009
17	Infrastructure	1.0	0.398	0.906	-4.407	-4.104	0.114	-0.643	0.306	...	0.068	0.011	0.613	0.579	0.101	0.186	0.079	0.008	0.006	0.068
18	Livestock	1.0	0.275	0.316	-0.843	0.304	0.416	0.781	0.599	...	0.064	0.060	0.163	0.062	0.052	0.019	0.035	0.023	0.016	0.034

19 rows × 23 columns

	hazard	R2 for full model	Fire Frequency removed score	total Days Fires Burned removed score	total Acres Fires Burned removed score	total Aerial Assets removed score	total Personnel removed score	total Cost removed score	total Structures Damaged removed score	...	Fire Frequency removed MSE	total Days Fires Burned removed MSE	total Acres Fires Burned removed MSE	total Aerial Assets removed MSE	total Personnel removed MSE	total Cost removed MSE	total Structures Damaged removed MSE	total Structures Destroyed removed MSE	total Injuries removed MSE	total Fatalities removed MSE
0	Hazardous Terrain	1.0	0.320	0.794	1.237	0.447	0.550	0.213	0.477	...	-0.026	-0.064	-0.099	-0.036	-0.044	-0.017	-0.038	-0.018	-0.015	-0.022
1	Ecological Resources	1.0	1.082	0.304	2.720	2.004	0.770	1.698	0.430	...	-0.155	-0.044	-0.391	-0.288	-0.110	-0.244	-0.062	-0.014	-0.010	-0.050
2	Thunderstorms	1.0	0.550	0.647	0.216	0.000	0.119	0.023	0.190	...	-0.053	-0.062	-0.021	-0.000	-0.011	-0.002	-0.018	-0.055	-0.012	-0.005
3	Wind	1.0	0.106	0.306	2.538	2.166	0.190	0.109	0.026	...	-0.010	-0.029	-0.240	-0.205	-0.018	-0.010	-0.002	-0.031	-0.024	-0.014
4	Dry Weather	1.0	0.123	0.264	3.479	3.215	0.552	0.960	0.286	...	-0.013	-0.028	-0.375	-0.347	-0.060	-0.103	-0.031	-0.014	-0.016	-0.035
5	Rain	1.0	0.107	0.058	1.920	3.795	0.000	0.534	0.000	...	-0.011	-0.006	-0.192	-0.380	-0.000	-0.053	-0.000	-0.027	-0.005	-0.002
6	Smoke	1.0	0.009	0.484	0.895	1.402	0.198	0.981	1.092	...	-0.001	-0.048	-0.089	-0.139	-0.020	-0.097	-0.108	-0.033	-0.001	-0.035
7	Evacuations	1.0	0.499	0.198	3.623	3.489	0.563	1.450	0.417	...	-0.062	-0.025	-0.453	-0.436	-0.070	-0.181	-0.052	-0.022	-0.009	-0.048
8	Injury	1.0	0.519	0.678	0.912	0.002	0.741	0.823	0.994	...	-0.051	-0.067	-0.090	-0.000	-0.073	-0.081	-0.098	-0.020	-0.002	-0.103
9	Resource Shortage	1.0	0.802	0.559	2.338	1.769	0.795	1.071	0.507	...	-0.088	-0.061	-0.257	-0.194	-0.087	-0.118	-0.056	-0.011	-0.015	-0.031
10	Road Closures	1.0	0.115	0.556	1.509	0.837	0.551	1.038	0.706	...	-0.011	-0.055	-0.149	-0.083	-0.055	-0.103	-0.070	-0.018	-0.006	-0.052
11	Command Transition	1.0	0.648	0.020	0.509	0.030	0.050	0.370	0.255	...	-0.076	-0.002	-0.059	-0.003	-0.006	-0.043	-0.030	-0.077	-0.000	-0.016
12	Inaccurate Mapping	1.0	0.127	0.217	0.817	0.376	0.192	0.621	0.574	...	-0.013	-0.022	-0.082	-0.038	-0.019	-0.062	-0.058	-0.043	-0.001	-0.024
13	Aerial Grounding	1.0	1.176	0.277	0.199	0.057	0.158	1.085	0.004	...	-0.154	-0.036	-0.026	-0.007	-0.021	-0.142	-0.001	-0.013	-0.004	-0.003
14	Military Base	1.0	1.645	0.011	9.206	4.829	2.656	0.245	0.098	...	-0.197	-0.001	-1.100	-0.577	-0.317	-0.029	-0.012	-0.066	-0.048	-0.037
15	Cultural Resources	1.0	0.470	0.393	0.447	0.001	0.773	1.268	1.694	...	-0.038	-0.032	-0.036	-0.000	-0.063	-0.103	-0.138	-0.007	-0.000	-0.050
16	Law Violations	1.0	0.190	0.001	10.595	19.816	0.875	1.861	0.092	...	-0.019	-0.000	-1.078	-2.017	-0.089	-0.189	-0.009	-0.017	-0.032	-0.009
17	Infrastructure	1.0	0.602	0.094	5.407	5.104	0.886	1.643	0.694	...	-0.068	-0.011	-0.613	-0.579	-0.101	-0.186	-0.079	-0.008	-0.006	-0.068
18	Livestock	1.0	0.725	0.684	1.843	0.696	0.584	0.219	0.401	...	-0.064	-0.060	-0.163	-0.062	-0.052	-0.019	-0.035	-0.023	-0.016	-0.034

19 rows × 23 columns

	Hazardous Terrain	Ecological Resources	Thunderstorms	Wind	Dry Weather	Rain	Smoke	Evacuations	Injury	Resource Shortage	Road Closures	Command Transition	Inaccurate Mapping	Aerial Grounding	Military Base	Cultural Resources	Law Violations	Infrastructure	Livestock
Fire Frequency	-0.313581	-0.771285	-0.449935	-0.196329	-0.225604	0.202982	0.059588	-0.488699	-0.442255	-0.581090	-0.208721	-0.538615	-0.220789	-0.768341	-0.867605	-0.382661	0.272065	-0.511384	-0.495606
Days Fires Burned	-0.385132	-0.318617	-0.380660	-0.259672	-0.257430	0.115959	-0.334563	-0.240227	-0.394124	-0.378236	-0.357760	-0.074543	-0.225562	-0.290985	0.054977	-0.272717	-0.013717	-0.157257	-0.375137
Acres Fires Burned	0.503948	0.999900	0.230524	0.783731	0.979924	0.701707	0.476732	1.077046	0.479210	0.810874	0.618201	0.390072	0.458425	0.258290	1.678019	0.304997	1.661379	1.252659	0.645825
Aerial Assets	-0.345966	-0.980415	-0.006346	-0.827021	-1.076034	-1.126873	-0.681862	-1.207323	-0.026165	-0.805801	-0.526007	-0.107781	-0.355459	-0.158250	-1.388392	-0.019445	-2.595552	-1.390232	-0.453423
Personnel	0.404772	0.640593	0.205707	0.258009	0.470289	-0.007586	0.270491	0.511258	0.520117	0.569720	0.449915	0.148036	0.267860	0.277383	1.085640	0.483215	0.575140	0.610948	0.437802
Cost	0.254719	0.962682	0.092205	0.198314	0.627064	0.450813	0.608486	0.830326	0.554710	0.668622	0.624789	0.405531	0.486784	0.735040	0.333345	0.625970	0.848553	0.841511	0.270968
Structures Damaged	-0.456810	-0.580431	-0.316060	-0.116516	-0.410260	0.000946	-0.768907	-0.533447	-0.730414	-0.551122	-0.617073	-0.402928	-0.560947	-0.055859	-0.252184	-0.867102	-0.225945	-0.655046	-0.439747
Structures Destroyed	0.428869	0.381243	0.757483	0.565964	0.377243	0.532924	0.583970	0.477964	0.457771	0.344615	0.436865	0.893835	0.668565	0.364034	-0.828011	0.272584	-0.425535	0.296102	0.492681
Injuries	0.278519	0.228356	0.249203	0.353876	0.288878	0.152031	0.058770	0.215857	0.090744	0.280603	0.171000	-0.001509	0.085335	0.148282	0.494017	-0.031798	0.406409	0.176420	0.286667
Fatalities	-0.288774	-0.437670	-0.135920	-0.230199	-0.368195	0.094284	-0.367375	-0.427765	-0.628802	-0.341738	-0.448117	-0.245164	-0.300802	-0.107585	-0.377401	-0.436124	-0.184218	-0.510116	-0.359751

[79]:

cols = [col for col in delta_df.columns if "MSE" in col]
delta_df.drop(cols, axis=1)

[79]:

	hazard	R2 for full model	Fire Frequency removed score	total Days Fires Burned removed score	total Acres Fires Burned removed score	total Aerial Assets removed score	total Personnel removed score	total Cost removed score	total Structures Damaged removed score	total Structures Destroyed removed score	total Injuries removed score	total Fatalities removed score
0	Hazardous Terrain	1.0	0.320	0.794	1.237	0.447	0.550	0.213	0.477	0.220	0.189	0.272
1	Ecological Resources	1.0	1.082	0.304	2.720	2.004	0.770	1.698	0.430	0.097	0.071	0.349
2	Thunderstorms	1.0	0.550	0.647	0.216	0.000	0.119	0.023	0.190	0.574	0.126	0.050
3	Wind	1.0	0.106	0.306	2.538	2.166	0.190	0.109	0.026	0.326	0.259	0.146
4	Dry Weather	1.0	0.123	0.264	3.479	3.215	0.552	0.960	0.286	0.127	0.151	0.329
5	Rain	1.0	0.107	0.058	1.920	3.795	0.000	0.534	0.000	0.272	0.045	0.023
6	Smoke	1.0	0.009	0.484	0.895	1.402	0.198	0.981	1.092	0.330	0.007	0.355
7	Evacuations	1.0	0.499	0.198	3.623	3.489	0.563	1.450	0.417	0.176	0.073	0.382
8	Injury	1.0	0.519	0.678	0.912	0.002	0.741	0.823	0.994	0.205	0.016	1.051
9	Resource Shortage	1.0	0.802	0.559	2.338	1.769	0.795	1.071	0.507	0.104	0.140	0.278
10	Road Closures	1.0	0.115	0.556	1.509	0.837	0.551	1.038	0.706	0.185	0.058	0.531
11	Command Transition	1.0	0.648	0.020	0.509	0.030	0.050	0.370	0.255	0.657	0.000	0.134
12	Inaccurate Mapping	1.0	0.127	0.217	0.817	0.376	0.192	0.621	0.574	0.427	0.014	0.235
13	Aerial Grounding	1.0	1.176	0.277	0.199	0.057	0.158	1.085	0.004	0.097	0.033	0.023
14	Military Base	1.0	1.645	0.011	9.206	4.829	2.656	0.245	0.098	0.551	0.399	0.311
15	Cultural Resources	1.0	0.470	0.393	0.447	0.001	0.773	1.268	1.694	0.088	0.002	0.611
16	Law Violations	1.0	0.190	0.001	10.595	19.816	0.875	1.861	0.092	0.171	0.317	0.087
17	Infrastructure	1.0	0.602	0.094	5.407	5.104	0.886	1.643	0.694	0.074	0.054	0.600
18	Livestock	1.0	0.725	0.684	1.843	0.696	0.584	0.219	0.401	0.264	0.182	0.383

Experimental

The following cells are experimental supplementary analyses related to the secondary analysis. Here there are experiments regarding predictors and colinearity.

[80]:

totals = {key: list(totals[key]) for key in totals}

[81]:

totals_new = {predictor: totals_scaled[predictor] for predictor in totals_scaled if predictor not in ["total Structures Damaged", "total Structures Destroyed"]}
totals_new["total structure"] = minmax_scale([totals["total Structures Damaged"][i]+totals["total Structures Destroyed"][i] for i in range(len(totals["total Structures Destroyed"]))])

[82]:

corrMatrix_fires, correlation_mat_total_fires, p_values = create_correlation_matrix(totals_new, fire_freqs_scaled, graph=False)

[83]:

correlation_mat_total_fires

[83]:

	Fire Frequency	total Days Fires Burned	total Acres Fires Burned	total Aerial Assets	total Personnel	total Cost	total Injuries	total Fatalities	total structure	Hazardous Terrain	...	Resource Shortage	Road Closures	Command Transition	Inaccurate Mapping	Aerial Grounding	Military Base	Cultural Resources	Law Violations	Infrastructure	Livestock
0	1.000000	0.968467	0.971840	0.903676	1.000000	0.351300	0.837050	0.88	0.233468	0.290398	...	0.130682	0.302326	0.000000	0.206667	0.000000	1.0	0.432692	0.582090	0.228571	0.142857
1	0.514501	0.417518	0.855911	0.641318	0.806630	0.484181	1.000000	0.84	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	0.904762	0.8	0.875000	0.656716	1.000000	1.000000
2	0.543502	0.000000	0.269501	0.395437	0.604627	1.000000	0.778731	0.76	0.542703	0.711944	...	0.857955	0.888372	0.534161	0.740000	1.000000	0.4	0.894231	0.791045	0.790476	0.619048
3	0.015038	0.251387	0.316664	0.065272	0.253145	0.123329	0.195540	0.12	0.192790	0.494145	...	0.602273	0.530233	0.360248	0.413333	0.523810	1.0	0.653846	0.686567	0.800000	0.533333
4	0.258861	0.124672	0.000000	0.000000	0.055537	0.000000	0.000000	0.00	0.000000	0.210773	...	0.136364	0.255814	0.006211	0.166667	0.095238	0.2	0.490385	0.223881	0.257143	0.152381
5	0.854995	1.000000	0.930638	0.426489	0.443150	0.278302	0.346484	0.36	0.723218	0.484778	...	0.431818	0.655814	0.627329	0.693333	0.142857	0.6	0.576923	1.000000	0.866667	0.457143
6	0.461869	0.924088	1.000000	1.000000	0.896474	0.817606	0.696398	0.64	0.729902	0.697892	...	0.778409	0.837209	0.894410	0.806667	0.904762	1.0	1.000000	0.313433	0.876190	0.695238
7	0.000000	0.569343	0.264532	0.419518	0.524836	0.559381	0.423671	1.00	0.264700	0.372365	...	0.488636	0.376744	0.453416	0.393333	0.761905	0.6	0.711538	0.104478	0.447619	0.333333
8	0.111708	0.266861	0.176067	0.245300	0.000000	0.547244	0.240137	0.16	0.576248	0.000000	...	0.000000	0.000000	0.124224	0.000000	0.619048	0.0	0.000000	0.000000	0.000000	0.000000

9 rows × 28 columns

[84]:

predictors = [p for p in totals_new]
hazards = [h for h in fire_freqs_scaled]
reshape_correlation_matrix(corrMatrix_fires, p_values, predictors, hazards)

[85]:

predictors = [p for p in totals_new]
hazards = [h.replace("total ","") for h in fire_freqs_scaled]
results_df, delta_df, coefficient_df = multiple_reg_feature_importance(predictors, hazards, correlation_mat_total_fires)
display(results_df, delta_df, coefficient_df)

	hazard	R2 for full model	Fire Frequency removed score	total Days Fires Burned removed score	total Acres Fires Burned removed score	total Aerial Assets removed score	total Personnel removed score	total Cost removed score	total Injuries removed score	...	total structure removed score	Fire Frequency removed MSE	total Days Fires Burned removed MSE	total Acres Fires Burned removed MSE	total Aerial Assets removed MSE	total Personnel removed MSE	total Cost removed MSE	total Injuries removed MSE	total Fatalities removed MSE	total structure removed MSE
0	Hazardous Terrain	1.0	0.998	-3.219	0.768	-0.623	-18.478	0.942	-5.249	...	-0.482	0.000	0.338	0.019	0.130	1.562	0.005	0.501	0.003	0.119
1	Ecological Resources	1.0	0.656	-1.463	-0.201	-2.790	-15.777	0.526	-4.314	...	0.093	0.049	0.354	0.172	0.544	2.409	0.068	0.763	0.000	0.130
2	Thunderstorms	1.0	0.919	-2.368	0.992	0.703	-11.873	0.782	-4.017	...	-1.165	0.008	0.324	0.001	0.029	1.238	0.021	0.482	0.017	0.208
3	Wind	1.0	0.994	-0.259	-0.664	-2.116	-3.930	1.000	0.152	...	0.025	0.001	0.119	0.157	0.295	0.466	0.000	0.080	0.000	0.092
4	Dry Weather	1.0	0.998	-0.988	-0.903	-4.106	-11.762	0.807	-2.505	...	0.132	0.000	0.214	0.205	0.551	1.376	0.021	0.378	0.000	0.094
5	Rain	1.0	0.775	0.991	-0.465	-3.489	-0.057	0.718	0.612	...	0.369	0.023	0.001	0.147	0.450	0.106	0.028	0.039	0.013	0.063
6	Smoke	1.0	0.292	-4.610	0.999	-3.227	-31.827	1.000	-16.137	...	-1.675	0.070	0.557	0.000	0.420	3.259	0.000	1.701	0.019	0.265
7	Evacuations	1.0	0.948	-1.310	-0.745	-4.893	-16.169	0.691	-4.964	...	-0.228	0.006	0.289	0.218	0.737	2.148	0.039	0.746	0.000	0.154
8	Injury	1.0	0.999	-4.561	0.985	0.284	-31.617	1.000	-13.042	...	-1.001	0.000	0.547	0.001	0.070	3.208	0.000	1.381	0.001	0.197
9	Resource Shortage	1.0	0.834	-2.396	0.128	-2.615	-17.949	0.861	-4.717	...	-0.018	0.018	0.373	0.096	0.397	2.082	0.015	0.628	0.003	0.112
10	Road Closures	1.0	0.937	-3.285	0.741	-1.578	-23.280	0.952	-8.632	...	-0.605	0.006	0.424	0.026	0.255	2.402	0.005	0.953	0.001	0.159
11	Command Transition	1.0	0.913	-0.671	0.991	0.409	-13.779	0.993	-7.398	...	-1.566	0.010	0.195	0.001	0.069	1.727	0.001	0.982	0.014	0.300
12	Inaccurate Mapping	1.0	0.937	-2.354	0.973	-0.752	-21.516	0.999	-10.048	...	-1.394	0.006	0.337	0.003	0.176	2.262	0.000	1.110	0.013	0.241
13	Aerial Grounding	1.0	0.080	0.345	0.913	0.852	-0.647	0.239	0.721	...	0.726	0.121	0.086	0.011	0.019	0.216	0.100	0.037	0.000	0.036
14	Military Base	1.0	-0.739	0.964	-8.432	-3.656	-0.897	0.705	0.295	...	0.182	0.208	0.004	1.127	0.556	0.227	0.035	0.084	0.045	0.098
15	Cultural Resources	1.0	0.985	-4.960	0.897	0.029	-42.847	0.999	-20.483	...	-0.955	0.001	0.485	0.008	0.079	3.566	0.000	1.747	0.010	0.159
16	Law Violations	1.0	0.762	0.980	-9.159	-19.375	-0.706	-0.674	0.927	...	0.832	0.024	0.002	1.034	2.074	0.174	0.170	0.007	0.005	0.017
17	Infrastructure	1.0	0.945	-1.311	-1.770	-7.370	-21.867	0.699	-7.042	...	-0.043	0.006	0.262	0.314	0.949	2.593	0.034	0.912	0.000	0.118
18	Livestock	1.0	0.875	-2.789	0.435	-1.008	-17.469	0.957	-4.752	...	-0.545	0.011	0.335	0.050	0.178	1.633	0.004	0.509	0.001	0.137

19 rows × 21 columns

	hazard	R2 for full model	Fire Frequency removed score	total Days Fires Burned removed score	total Acres Fires Burned removed score	total Aerial Assets removed score	total Personnel removed score	total Cost removed score	total Injuries removed score	...	total structure removed score	Fire Frequency removed MSE	total Days Fires Burned removed MSE	total Acres Fires Burned removed MSE	total Aerial Assets removed MSE	total Personnel removed MSE	total Cost removed MSE	total Injuries removed MSE	total Fatalities removed MSE	total structure removed MSE
0	Hazardous Terrain	1.0	0.002	4.219	0.232	1.623	19.478	0.058	6.249	...	1.482	-0.000	-0.338	-0.019	-0.130	-1.562	-0.005	-0.501	-0.003	-0.119
1	Ecological Resources	1.0	0.344	2.463	1.201	3.790	16.777	0.474	5.314	...	0.907	-0.049	-0.354	-0.172	-0.544	-2.409	-0.068	-0.763	-0.000	-0.130
2	Thunderstorms	1.0	0.081	3.368	0.008	0.297	12.873	0.218	5.017	...	2.165	-0.008	-0.324	-0.001	-0.029	-1.238	-0.021	-0.482	-0.017	-0.208
3	Wind	1.0	0.006	1.259	1.664	3.116	4.930	0.000	0.848	...	0.975	-0.001	-0.119	-0.157	-0.295	-0.466	-0.000	-0.080	-0.000	-0.092
4	Dry Weather	1.0	0.002	1.988	1.903	5.106	12.762	0.193	3.505	...	0.868	-0.000	-0.214	-0.205	-0.551	-1.376	-0.021	-0.378	-0.000	-0.094
5	Rain	1.0	0.225	0.009	1.465	4.489	1.057	0.282	0.388	...	0.631	-0.023	-0.001	-0.147	-0.450	-0.106	-0.028	-0.039	-0.013	-0.063
6	Smoke	1.0	0.708	5.610	0.001	4.227	32.827	0.000	17.137	...	2.675	-0.070	-0.557	-0.000	-0.420	-3.259	-0.000	-1.701	-0.019	-0.265
7	Evacuations	1.0	0.052	2.310	1.745	5.893	17.169	0.309	5.964	...	1.228	-0.006	-0.289	-0.218	-0.737	-2.148	-0.039	-0.746	-0.000	-0.154
8	Injury	1.0	0.001	5.561	0.015	0.716	32.617	0.000	14.042	...	2.001	-0.000	-0.547	-0.001	-0.070	-3.208	-0.000	-1.381	-0.001	-0.197
9	Resource Shortage	1.0	0.166	3.396	0.872	3.615	18.949	0.139	5.717	...	1.018	-0.018	-0.373	-0.096	-0.397	-2.082	-0.015	-0.628	-0.003	-0.112
10	Road Closures	1.0	0.063	4.285	0.259	2.578	24.280	0.048	9.632	...	1.605	-0.006	-0.424	-0.026	-0.255	-2.402	-0.005	-0.953	-0.001	-0.159
11	Command Transition	1.0	0.087	1.671	0.009	0.591	14.779	0.007	8.398	...	2.566	-0.010	-0.195	-0.001	-0.069	-1.727	-0.001	-0.982	-0.014	-0.300
12	Inaccurate Mapping	1.0	0.063	3.354	0.027	1.752	22.516	0.001	11.048	...	2.394	-0.006	-0.337	-0.003	-0.176	-2.262	-0.000	-1.110	-0.013	-0.241
13	Aerial Grounding	1.0	0.920	0.655	0.087	0.148	1.647	0.761	0.279	...	0.274	-0.121	-0.086	-0.011	-0.019	-0.216	-0.100	-0.037	-0.000	-0.036
14	Military Base	1.0	1.739	0.036	9.432	4.656	1.897	0.295	0.705	...	0.818	-0.208	-0.004	-1.127	-0.556	-0.227	-0.035	-0.084	-0.045	-0.098
15	Cultural Resources	1.0	0.015	5.960	0.103	0.971	43.847	0.001	21.483	...	1.955	-0.001	-0.485	-0.008	-0.079	-3.566	-0.000	-1.747	-0.010	-0.159
16	Law Violations	1.0	0.238	0.020	10.159	20.375	1.706	1.674	0.073	...	0.168	-0.024	-0.002	-1.034	-2.074	-0.174	-0.170	-0.007	-0.005	-0.017
17	Infrastructure	1.0	0.055	2.311	2.770	8.370	22.867	0.301	8.042	...	1.043	-0.006	-0.262	-0.314	-0.949	-2.593	-0.034	-0.912	-0.000	-0.118
18	Livestock	1.0	0.125	3.789	0.565	2.008	18.469	0.043	5.752	...	1.545	-0.011	-0.335	-0.050	-0.178	-1.633	-0.004	-0.509	-0.001	-0.137

19 rows × 21 columns

	Hazardous Terrain	Ecological Resources	Thunderstorms	Wind	Dry Weather	Rain	Smoke	Evacuations	Injury	Resource Shortage	Road Closures	Command Transition	Inaccurate Mapping	Aerial Grounding	Military Base	Cultural Resources	Law Violations	Infrastructure	Livestock
Fire Frequency	-0.026710	-0.434907	-0.172281	-0.044706	0.030674	0.294113	0.518757	-0.157628	-0.022735	-0.264684	0.154315	-0.196994	0.155867	-0.679722	-0.892195	0.068827	0.304387	-0.154814	-0.205733
Days Fires Burned	-0.887679	-0.907474	-0.868228	-0.526449	-0.706361	-0.045041	-1.138592	-0.820146	-1.128467	-0.932083	-0.993370	-0.674310	-0.885755	-0.447021	0.100759	-1.062265	-0.068719	-0.781091	-0.883145
Acres Fires Burned	0.218366	0.664421	-0.044154	0.634500	0.724769	0.612986	0.019111	0.747378	0.060722	0.495242	0.256264	0.051919	0.083995	0.171237	1.698509	-0.146490	1.626812	0.896498	0.357554
Aerial Assets	-0.659419	-1.348238	-0.308950	-0.991925	-1.356072	-1.225549	-1.183809	-1.569110	-0.484941	-1.151817	-0.922921	-0.480188	-0.766777	-0.254556	-1.363317	-0.513679	-2.631943	-1.780378	-0.770022
Personnel	2.408827	2.991060	2.143770	1.315660	2.260642	0.627199	3.478669	2.824167	3.451628	2.780700	2.986533	2.532783	2.898649	0.895393	0.917536	3.639132	0.803142	3.103003	2.462551
Cost	-0.132613	0.508451	-0.282524	-0.006252	0.281038	0.327950	-0.011527	0.383307	-0.011801	0.241353	0.134572	-0.055549	-0.021726	0.615493	0.366185	0.016190	0.804696	0.359965	-0.120391
Injuries	-1.601518	-1.975918	-1.571002	-0.640381	-1.390647	-0.445877	-2.950252	-1.953818	-2.658327	-1.792774	-2.208000	-2.241003	-2.383291	-0.432888	0.656517	-2.989975	0.195391	-2.159981	-1.613143
Fatalities	0.111248	0.031103	0.252046	-0.017979	-0.010848	0.222284	0.272664	0.033852	-0.044206	0.099170	0.057863	0.232096	0.224665	0.016528	-0.413534	0.192506	-0.140251	-0.013458	0.044595
structure	1.055031	1.103966	1.395890	0.928990	0.936069	0.769336	1.576520	1.198997	1.357454	1.023104	1.219264	1.675715	1.500671	0.579354	-0.956481	1.220021	-0.399779	1.051994	1.131003

Colinearity

[86]:

from statsmodels.stats.outliers_influence import variance_inflation_factor

[87]:

vif_data = pd.DataFrame()
input_df = pd.DataFrame({predictor:totals_new[predictor] for predictor in totals_new})
vif_data["feature"] = input_df.columns
vif_data["VIF"] = [variance_inflation_factor(input_df.values, i)
                          for i in range(len(input_df.columns))]
vif_data

[87]:

	feature	VIF
0	Fire Frequency	10.545422
1	total Days Fires Burned	98.508631
2	total Acres Fires Burned	318.148035
3	total Aerial Assets	47.954849
4	total Personnel	215.555330
5	total Cost	21.487662
6	total Injuries	187.177715
7	total Fatalities	80.066638
8	total structure	54.806470

[88]:

vif_data = pd.DataFrame()
input_df = pd.DataFrame({predictor:combined_predictors_scaled[predictor] for predictor in combined_predictors_scaled})
vif_data["feature"] = input_df.columns
vif_data["VIF"] = [variance_inflation_factor(input_df.values, i)
                          for i in range(len(input_df.columns))]
vif_data

[88]:

	feature	VIF
0	Fire Characteristics	3.984066
1	Operations	17.004590
2	Intensity	13.810379

[89]:

sums = []
for col in input_df:
    temp_input = input_df.drop(col, axis=1)
    vif_data = pd.DataFrame()
    vif_data["feature"] = temp_input.columns
    vif_data["VIF"] = [variance_inflation_factor(temp_input.values, i)
                          for i in range(len(temp_input.columns))]
    sum_vif = sum(vif_data["VIF"].tolist())
    sums.append(sum_vif)
    display(col,sum_vif, vif_data)
print(min(sums), sums.index(min(sums)), predictors[sums.index(min(sums))])

'Fire Characteristics'

27.612735626886067

	feature	VIF
0	Operations	13.806368
1	Intensity	13.806368

'Operations'

6.469486432284111

	feature	VIF
0	Fire Characteristics	3.234743
1	Intensity	3.234743

'Intensity'

7.96581799128896

	feature	VIF
0	Fire Characteristics	3.982909
1	Operations	3.982909

6.469486432284111 1 total Days Fires Burned

[90]:

# to_drop = ["total Acres Fires Burned", "total Personnel", "total Injuries", "total Aerial Assets", "total Cost"]#,"total Days Fires Burned"]
# temp_input = input_df.drop(to_drop, axis=1)
# vif_data = pd.DataFrame()
# vif_data["feature"] = temp_input.columns
# vif_data["VIF"] = [variance_inflation_factor(temp_input.values, i)
#                       for i in range(len(temp_input.columns))]
# sum_vif = sum(vif_data["VIF"].tolist())
# display(sum_vif, vif_data)