Data Example

This notebook provides an example of how to use the Data class. In this example, we use the ICS-209-PLUS data set of wildfire incident reports.

[1]:

import sys
import os
sys.path.append(os.path.join("..",".."))

from mika.utils import Data
from mika.utils.stopwords.ICS_stop_words import stop_words

Initiate object

[2]:

ICS_data = Data()

Load file

Initiate loading information

filename
text columns
extra non-text columns you want to keep
column with document ids

[3]:

file_name = os.path.join('..','..','data','ICS','ics209-plus-wf_sitreps_1999to2014.csv')
text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
document_id_col = "INCIDENT_ID"

Load the data

[4]:

ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})

[5]:

ICS_data.data_df

[5]:

	ACRES	ADDTNL_COOP_ASSIST_ORG_NARR	CAUSE	COMPLEX	COMPLEXITY_LEVEL_NARR	COMPLEX_NAME	CRIT_RES_NEEDS_NARR	CURRENT_THREAT_NARR	CURR_INCIDENT_AREA	CURR_INC_AREA_UOM	...	UNIT_OR_OTHER_NARR	WEATHER_CONCERNS_NARR	INCTYP_DESC	INCTYP_ABBREVIATION	REPORT_DOY	DISCOVERY_DOY	NEW_ACRES	REPORT_DAY_SPAN	WF_FSR	MAX_FIRE_PCT_FINAL_SIZE
0	1500.0		H	False							...	AIC		Wildfire	WF	165	165	1500.0	1.0	1500.0	0.39473684210526316
1	2100.0		H	False							...	AIC		Wildfire	WF	166	165	600.0	1.0	600.0	0.5526315789473685
2	3000.0		H	False							...	AIC		Wildfire	WF	167	165	900.0	1.0	900.0	0.7894736842105263
3	3800.0		H	False							...	AIC		Wildfire	WF	168	165	800.0	1.0	800.0	1.0
4	3800.0		H	False							...	AIC		Wildfire	WF	169	165	0.0	1.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
120799	150.0		H	False				contained	150.0	Acres	...			Wildfire	WF	82	80	0.0	1.0	0.0	1.0
120800	150.0		H	False				contained	150.0	Acres	...			Wildfire	WF	82	80	0.0	1.0	0.0	1.0
120801	1900.0		L	False	3			everything in canyons is natural vegetation. a...	2500.0	Acres	...		imt has spot weather forecast for next 24 hour...	Wildfire	WF	234	232	1900.0	2.0	950.0	1.0
120802	1900.0		L	False	3		none	everything in canyons is natural vegetation. a...	1900.0	Acres	...		imt has spot weather forecast for next 24 hour...	Wildfire	WF	235	232	0.0	1.0	0.0	1.0
120803	1900.0		L	False	3		none		1900.0	Acres	...			Wildfire	WF	235	232	0.0	1.0	0.0	1.0

120804 rows × 114 columns

Prepare data

This step provides simple data preparation including: combining columns, removing incomplete rows, and creating unique ids

[6]:

ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)

Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:23<00:00, 1455.22it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20485.25it/s]

data preparation:  1.44 minutes

If the columns were combined and you are only interested in the combine text, reset the text column variable:

[7]:

ICS_data.text_columns = ['Combined Text']

Preprocessing

The preprocessing step provides a more thorough preprocessing required for traditional NLP methods, such as LDA and hLDA topic modeling. For preprocessing, you should define:

any additional domain-specific stop words
any potential stop words you would like to keep
function arguments

Function arguments correspond to optional preprocessing steps, such as ngrams, spellcheck, quote mark removal, and dropping short documents.

[8]:

from mika.utils.stopwords.ICS_stop_words import stop_words

ICS_stop_words = stop_words
save_words = ['jurisdictions', 'team', 'command', 'organization', 'type', 'involved', 'transition', 'transfer', 'impact', 'concern', 'site', 'nation', 'political', 'social', 'adjacent', 'community', 'cultural', 'tribal', 'monument', 'archeaological', 'highway', 'traffic', 'road', 'travel', 'interstate', 'closure', 'remain', 'remains', 'close', 'block', 'continue', 'impact', 'access', 'limit', 'limited', 'terrain', 'rollout', 'snag', 'steep', 'debris', 'access', 'terrian', 'concern', 'hazardous', 'pose', 'heavy', 'rugged', 'difficult', 'steep', 'narrow', 'violation', 'notification', 'respond', 'law', 'patrol', 'cattle', 'buffalo', 'grow', 'allotment', 'ranch', 'sheep', 'livestock', 'grazing', 'pasture', 'threaten', 'concern', 'risk', 'threat', 'evacuation', 'evacuate', ' threaten', 'threat', 'resident', ' residence', 'level', 'notice', 'community', 'structure', 'subdivision', 'mandatory', 'order', 'effect', 'remain', 'continue', 'issued', 'issue', 'injury', 'hospital', 'injured', 'accident', 'treatment', 'laceration', 'firefighter', 'treated', 'minor', 'report', 'transport', 'heat', 'shoulder', 'ankle', 'medical', 'released', 'military', 'unexploded', 'national', 'training', 'present', 'ordinance', 'guard', 'infrastructure', 'utility', 'powerline', 'water', 'electric', 'pipeline', 'powerlines', 'watershed', 'pole', 'power', 'gas', 'concern', 'near', 'hazard', 'critical', 'threaten', 'threat', 'off', 'weather', 'behavior', 'wind', 'thunderstorm', 'storm', 'gusty', 'lightning', 'flag', 'unpredictable', 'extreme', 'erratic', 'strong', 'red', 'warning', 'species', 'specie', 'habitat', 'animal', 'plant', 'conservation', 'threaten', 'endanger', 'threat', 'sensitive', 'threatened', 'endangered', 'risk', 'loss', 'impacts', 'unstaffed', 'resources', 'support', 'crew', 'aircraft', 'helicopter', 'engines', 'staffing', 'staff', 'lack', 'need', 'shortage', 'minimal', 'share', 'necessary', 'limited', 'limit', 'fatigue', 'flood', 'flashflood', 'flash', 'risk', 'potential', 'mapping', 'map', 'reflect', 'accurate', 'adjustment', 'change', 'reflect', 'aircraft', 'heli', 'helicopter', 'aerial', 'tanker', 'copter', 'grounded', 'ground', 'suspended', 'suspend', 'smoke', 'impact', 'hazard', 'windy', 'humidity', 'moisture', 'hot', 'drought', 'low', 'dry', 'prolonged']

[9]:

ICS_data.preprocess_data(domain_stopwords=ICS_stop_words, ngrams=False, save_words=save_words)

Preprocessing Combined Text…: 100%|██████████| 100/100 [57:47<00:00, 34.68s/it]
Removing frequent words…: 100%|██████████| 1/1 [29:07<00:00, 1747.63s/it]

Processing time:  86.94  minutes

Sentence Tokenization

Additional processing options include sentence tokenization. Note that this must be applied to non-preprocessed data.

[7]:

text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})
ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)
ICS_data.sentence_tokenization()

Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:22<00:00, 1469.66it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20735.39it/s]

data preparation:  1.42 minutes

Sentence Tokenization…: 100%|██████████| 37350/37350 [01:32<00:00, 404.89it/s]

Save results

After preprocessing your dataset, make sure to save it

[11]:

ICS_data.save()

Loading preprocessed data

Previously processed MIKA data objects can be reloaded in

[16]:

file = os.path.join('..','..','data','ICS','ICS_filtered_preprocessed_combined_data.csv')
ICS_data = Data()
ICS_data.load(file, preprocessed=True, id_col=document_id_col, text_columns=["Combined Text"], name='ICS',  load_kwargs={'dtype':str})

[15]:

ICS_data.data_df

[15]:

	CY	DISCOVERY_DATE	INCIDENT_ID	PCT_CONTAINED_COMPLETED	START_YEAR	TOTAL_AERIAL	TOTAL_PERSONNEL	REPORT_DOY	DISCOVERY_DOY	Combined Text	...	EST_IM_COST_TO_DATE	STR_DAMAGED	STR_DESTROYED	NEW_ACRES	POO_STATE	POO_LATITUDE	POO_LONGITUDE	WEATHER_CONCERNS_NARR	INC_MGMT_ORG_ABBREV	EVACUATION_IN_PROGRESS
0	2010	2010-07-15 15:00:00	2000_CA-RRU-062485_VALLEY COMPLEX	80.0	2010.0	5.000000	230.000000	197	196	[resource, share, cactus]	...	10000.0	0.0	0.0	70.0	CA	33.619444	-116.9	Current Weather RH: 45 TEMP: 75 WS: 10 WD: VAR...	4	False
1	2010	2010-07-15 15:00:00	2000_CA-RRU-062485_VALLEY COMPLEX	60.0	2010.0	5.000000	230.000000	197	196	[resource, share, incident, cactus, incident, ...	...	90000.0	0.0	0.0	503.0	CA	33.666667	-117.0	Current Weather RH: 52 TEMP: 78 WS: 12 WD: Var...	4	False
2	2010	2010-07-15 15:00:00	2000_CA-RRU-062485_VALLEY COMPLEX	30.0	2010.0	4.000000	165.000000	197	196	[resource, share, cactus, erratic, wind, due, ...	...	45000.0	0.0	0.0	450.0	CA	33.65	-116.9	Current Weather RH: 52 TEMP: 78 WS: 12 WD: VAR...	4	False
3	2010	2010-07-15 15:00:00	2000_CA-RRU-062485_VALLEY COMPLEX	100.0	2010.0	4.333333	192.333333	197	196	[resource, share, cactus, cactus, become, vall...	...	10000.0	0.0	0.0	10.0	CA	33.619444	-116.9	Current Weather RH: 21 TEMP: 102 WS: 12 WD: SW...	4	False
4	2010	2010-07-15 15:00:00	2000_CA-RRU-062485_VALLEY COMPLEX	60.0	2010.0	4.333333	192.333333	197	196	[resource, share, cactus, cactus, become, vall...	...	50000.0	0.0	0.0	0.0	CA	33.666667	-117.0	Current Weather RH: 22 TEMP: 104 WS: 15 WD: SW	4	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
44358	2014	2014-03-15 14:30:00	2014_VAVAS1403037_BEAVER LODGE RD.	100.0	2014.0	0.000000	13.000000	74	74	[fast, spread, field]	...	450.0	1.0	1.0	0.0	VA	38.521389	-77.576111
44359	2014	2014-03-19 14:00:00	2014_VAVAS1406037_AIRPORT MOUNTAIN	85.0	2014.0	0.000000	18.500000	80	78	[heavy, plume, primary, carrier]	...	4000.0	0.0	0.0	85.0	VA	37.243056	-82.103333
44360	2014	2014-08-20 13:00:00	2014_WA-WFS-513_SAND RIDGE	0.0	2014.0	1.000000	95.000000	234	232	[heavy, canyon, river, mainly, canyon, come, e...	...	50000.0	0.0	0.0	1900.0	WA	46.000395	-120.024213	imt has spot weather forecast for next 24 hour...
44361	2014	2014-08-20 13:00:00	2014_WA-WFS-513_SAND RIDGE	86.0	2014.0	1.000000	120.000000	235	232	[laid, night, test, wind, remain, canyon, peri...	...	185000.0	0.0	0.0	0.0	WA	46.000395	-120.024213	imt has spot weather forecast for next 24 hour...
44362	2014	2014-08-20 13:00:00	2014_WA-WFS-513_SAND RIDGE	100.0	2014.0	0.000000	46.000000	235	232	[report, incident, wind, test, overnight]	...	200000.0	0.0	0.0	0.0	WA	46.000395	-120.024213

44363 rows × 25 columns