Data Example

This notebook provides an example of how to use the Data class. In this example, we use the ICS-209-PLUS data set of wildfire incident reports.

[1]:
import sys
import os
sys.path.append(os.path.join("..",".."))

from mika.utils import Data
from mika.utils.stopwords.ICS_stop_words import stop_words

Initiate object

[2]:
ICS_data = Data()

Load file

Initiate loading information

  • filename

  • text columns

  • extra non-text columns you want to keep

  • column with document ids

[3]:
file_name = os.path.join('..','..','data','ICS','ics209-plus-wf_sitreps_1999to2014.csv')
text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
document_id_col = "INCIDENT_ID"

Load the data

[4]:
ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})
[5]:
ICS_data.data_df
[5]:
ACRES ADDTNL_COOP_ASSIST_ORG_NARR CAUSE COMPLEX COMPLEXITY_LEVEL_NARR COMPLEX_NAME CRIT_RES_NEEDS_NARR CURRENT_THREAT_NARR CURR_INCIDENT_AREA CURR_INC_AREA_UOM ... UNIT_OR_OTHER_NARR WEATHER_CONCERNS_NARR INCTYP_DESC INCTYP_ABBREVIATION REPORT_DOY DISCOVERY_DOY NEW_ACRES REPORT_DAY_SPAN WF_FSR MAX_FIRE_PCT_FINAL_SIZE
0 1500.0 H False ... AIC Wildfire WF 165 165 1500.0 1.0 1500.0 0.39473684210526316
1 2100.0 H False ... AIC Wildfire WF 166 165 600.0 1.0 600.0 0.5526315789473685
2 3000.0 H False ... AIC Wildfire WF 167 165 900.0 1.0 900.0 0.7894736842105263
3 3800.0 H False ... AIC Wildfire WF 168 165 800.0 1.0 800.0 1.0
4 3800.0 H False ... AIC Wildfire WF 169 165 0.0 1.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
120799 150.0 H False contained 150.0 Acres ... Wildfire WF 82 80 0.0 1.0 0.0 1.0
120800 150.0 H False contained 150.0 Acres ... Wildfire WF 82 80 0.0 1.0 0.0 1.0
120801 1900.0 L False 3 everything in canyons is natural vegetation. a... 2500.0 Acres ... imt has spot weather forecast for next 24 hour... Wildfire WF 234 232 1900.0 2.0 950.0 1.0
120802 1900.0 L False 3 none everything in canyons is natural vegetation. a... 1900.0 Acres ... imt has spot weather forecast for next 24 hour... Wildfire WF 235 232 0.0 1.0 0.0 1.0
120803 1900.0 L False 3 none 1900.0 Acres ... Wildfire WF 235 232 0.0 1.0 0.0 1.0

120804 rows × 114 columns

Prepare data

This step provides simple data preparation including: combining columns, removing incomplete rows, and creating unique ids

[6]:
ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)
Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:23<00:00, 1455.22it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20485.25it/s]
data preparation:  1.44 minutes


If the columns were combined and you are only interested in the combine text, reset the text column variable:

[7]:
ICS_data.text_columns = ['Combined Text']

Preprocessing

The preprocessing step provides a more thorough preprocessing required for traditional NLP methods, such as LDA and hLDA topic modeling. For preprocessing, you should define:

  • any additional domain-specific stop words

  • any potential stop words you would like to keep

  • function arguments

Function arguments correspond to optional preprocessing steps, such as ngrams, spellcheck, quote mark removal, and dropping short documents.

[8]:
from mika.utils.stopwords.ICS_stop_words import stop_words

ICS_stop_words = stop_words
save_words = ['jurisdictions', 'team', 'command', 'organization', 'type', 'involved', 'transition', 'transfer', 'impact', 'concern', 'site', 'nation', 'political', 'social', 'adjacent', 'community', 'cultural', 'tribal', 'monument', 'archeaological', 'highway', 'traffic', 'road', 'travel', 'interstate', 'closure', 'remain', 'remains', 'close', 'block', 'continue', 'impact', 'access', 'limit', 'limited', 'terrain', 'rollout', 'snag', 'steep', 'debris', 'access', 'terrian', 'concern', 'hazardous', 'pose', 'heavy', 'rugged', 'difficult', 'steep', 'narrow', 'violation', 'notification', 'respond', 'law', 'patrol', 'cattle', 'buffalo', 'grow', 'allotment', 'ranch', 'sheep', 'livestock', 'grazing', 'pasture', 'threaten', 'concern', 'risk', 'threat', 'evacuation', 'evacuate', ' threaten', 'threat', 'resident', ' residence', 'level', 'notice', 'community', 'structure', 'subdivision', 'mandatory', 'order', 'effect', 'remain', 'continue', 'issued', 'issue', 'injury', 'hospital', 'injured', 'accident', 'treatment', 'laceration', 'firefighter', 'treated', 'minor', 'report', 'transport', 'heat', 'shoulder', 'ankle', 'medical', 'released', 'military', 'unexploded', 'national', 'training', 'present', 'ordinance', 'guard', 'infrastructure', 'utility', 'powerline', 'water', 'electric', 'pipeline', 'powerlines', 'watershed', 'pole', 'power', 'gas', 'concern', 'near', 'hazard', 'critical', 'threaten', 'threat', 'off', 'weather', 'behavior', 'wind', 'thunderstorm', 'storm', 'gusty', 'lightning', 'flag', 'unpredictable', 'extreme', 'erratic', 'strong', 'red', 'warning', 'species', 'specie', 'habitat', 'animal', 'plant', 'conservation', 'threaten', 'endanger', 'threat', 'sensitive', 'threatened', 'endangered', 'risk', 'loss', 'impacts', 'unstaffed', 'resources', 'support', 'crew', 'aircraft', 'helicopter', 'engines', 'staffing', 'staff', 'lack', 'need', 'shortage', 'minimal', 'share', 'necessary', 'limited', 'limit', 'fatigue', 'flood', 'flashflood', 'flash', 'risk', 'potential', 'mapping', 'map', 'reflect', 'accurate', 'adjustment', 'change', 'reflect', 'aircraft', 'heli', 'helicopter', 'aerial', 'tanker', 'copter', 'grounded', 'ground', 'suspended', 'suspend', 'smoke', 'impact', 'hazard', 'windy', 'humidity', 'moisture', 'hot', 'drought', 'low', 'dry', 'prolonged']
[9]:
ICS_data.preprocess_data(domain_stopwords=ICS_stop_words, ngrams=False, save_words=save_words)
Preprocessing Combined Text…: 100%|██████████| 100/100 [57:47<00:00, 34.68s/it]
Removing frequent words…: 100%|██████████| 1/1 [29:07<00:00, 1747.63s/it]
Processing time:  86.94  minutes

Sentence Tokenization

Additional processing options include sentence tokenization. Note that this must be applied to non-preprocessed data.

[7]:
text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})
ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)
ICS_data.sentence_tokenization()
Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:22<00:00, 1469.66it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20735.39it/s]
data preparation:  1.42 minutes

Sentence Tokenization…: 100%|██████████| 37350/37350 [01:32<00:00, 404.89it/s]

Save results

After preprocessing your dataset, make sure to save it

[11]:
ICS_data.save()

Loading preprocessed data

Previously processed MIKA data objects can be reloaded in

[16]:
file = os.path.join('..','..','data','ICS','ICS_filtered_preprocessed_combined_data.csv')
ICS_data = Data()
ICS_data.load(file, preprocessed=True, id_col=document_id_col, text_columns=["Combined Text"], name='ICS',  load_kwargs={'dtype':str})
[15]:
ICS_data.data_df
[15]:
CY DISCOVERY_DATE INCIDENT_ID PCT_CONTAINED_COMPLETED START_YEAR TOTAL_AERIAL TOTAL_PERSONNEL REPORT_DOY DISCOVERY_DOY Combined Text ... EST_IM_COST_TO_DATE STR_DAMAGED STR_DESTROYED NEW_ACRES POO_STATE POO_LATITUDE POO_LONGITUDE WEATHER_CONCERNS_NARR INC_MGMT_ORG_ABBREV EVACUATION_IN_PROGRESS
0 2010 2010-07-15 15:00:00 2000_CA-RRU-062485_VALLEY COMPLEX 80.0 2010.0 5.000000 230.000000 197 196 [resource, share, cactus] ... 10000.0 0.0 0.0 70.0 CA 33.619444 -116.9 Current Weather RH: 45 TEMP: 75 WS: 10 WD: VAR... 4 False
1 2010 2010-07-15 15:00:00 2000_CA-RRU-062485_VALLEY COMPLEX 60.0 2010.0 5.000000 230.000000 197 196 [resource, share, incident, cactus, incident, ... ... 90000.0 0.0 0.0 503.0 CA 33.666667 -117.0 Current Weather RH: 52 TEMP: 78 WS: 12 WD: Var... 4 False
2 2010 2010-07-15 15:00:00 2000_CA-RRU-062485_VALLEY COMPLEX 30.0 2010.0 4.000000 165.000000 197 196 [resource, share, cactus, erratic, wind, due, ... ... 45000.0 0.0 0.0 450.0 CA 33.65 -116.9 Current Weather RH: 52 TEMP: 78 WS: 12 WD: VAR... 4 False
3 2010 2010-07-15 15:00:00 2000_CA-RRU-062485_VALLEY COMPLEX 100.0 2010.0 4.333333 192.333333 197 196 [resource, share, cactus, cactus, become, vall... ... 10000.0 0.0 0.0 10.0 CA 33.619444 -116.9 Current Weather RH: 21 TEMP: 102 WS: 12 WD: SW... 4 False
4 2010 2010-07-15 15:00:00 2000_CA-RRU-062485_VALLEY COMPLEX 60.0 2010.0 4.333333 192.333333 197 196 [resource, share, cactus, cactus, become, vall... ... 50000.0 0.0 0.0 0.0 CA 33.666667 -117.0 Current Weather RH: 22 TEMP: 104 WS: 15 WD: SW 4 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
44358 2014 2014-03-15 14:30:00 2014_VAVAS1403037_BEAVER LODGE RD. 100.0 2014.0 0.000000 13.000000 74 74 [fast, spread, field] ... 450.0 1.0 1.0 0.0 VA 38.521389 -77.576111
44359 2014 2014-03-19 14:00:00 2014_VAVAS1406037_AIRPORT MOUNTAIN 85.0 2014.0 0.000000 18.500000 80 78 [heavy, plume, primary, carrier] ... 4000.0 0.0 0.0 85.0 VA 37.243056 -82.103333
44360 2014 2014-08-20 13:00:00 2014_WA-WFS-513_SAND RIDGE 0.0 2014.0 1.000000 95.000000 234 232 [heavy, canyon, river, mainly, canyon, come, e... ... 50000.0 0.0 0.0 1900.0 WA 46.000395 -120.024213 imt has spot weather forecast for next 24 hour...
44361 2014 2014-08-20 13:00:00 2014_WA-WFS-513_SAND RIDGE 86.0 2014.0 1.000000 120.000000 235 232 [laid, night, test, wind, remain, canyon, peri... ... 185000.0 0.0 0.0 0.0 WA 46.000395 -120.024213 imt has spot weather forecast for next 24 hour...
44362 2014 2014-08-20 13:00:00 2014_WA-WFS-513_SAND RIDGE 100.0 2014.0 0.000000 46.000000 235 232 [report, incident, wind, test, overnight] ... 200000.0 0.0 0.0 0.0 WA 46.000395 -120.024213

44363 rows × 25 columns