Data Example
This notebook provides an example of how to use the Data class. In this example, we use the ICS-209-PLUS data set of wildfire incident reports.
[1]:
import sys
import os
sys.path.append(os.path.join("..",".."))
from mika.utils import Data
from mika.utils.stopwords.ICS_stop_words import stop_words
Initiate object
[2]:
ICS_data = Data()
Load file
Initiate loading information
filename
text columns
extra non-text columns you want to keep
column with document ids
[3]:
file_name = os.path.join('..','..','data','ICS','ics209-plus-wf_sitreps_1999to2014.csv')
text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
document_id_col = "INCIDENT_ID"
Load the data
[4]:
ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})
[5]:
ICS_data.data_df
[5]:
ACRES | ADDTNL_COOP_ASSIST_ORG_NARR | CAUSE | COMPLEX | COMPLEXITY_LEVEL_NARR | COMPLEX_NAME | CRIT_RES_NEEDS_NARR | CURRENT_THREAT_NARR | CURR_INCIDENT_AREA | CURR_INC_AREA_UOM | ... | UNIT_OR_OTHER_NARR | WEATHER_CONCERNS_NARR | INCTYP_DESC | INCTYP_ABBREVIATION | REPORT_DOY | DISCOVERY_DOY | NEW_ACRES | REPORT_DAY_SPAN | WF_FSR | MAX_FIRE_PCT_FINAL_SIZE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1500.0 | H | False | ... | AIC | Wildfire | WF | 165 | 165 | 1500.0 | 1.0 | 1500.0 | 0.39473684210526316 | ||||||||
1 | 2100.0 | H | False | ... | AIC | Wildfire | WF | 166 | 165 | 600.0 | 1.0 | 600.0 | 0.5526315789473685 | ||||||||
2 | 3000.0 | H | False | ... | AIC | Wildfire | WF | 167 | 165 | 900.0 | 1.0 | 900.0 | 0.7894736842105263 | ||||||||
3 | 3800.0 | H | False | ... | AIC | Wildfire | WF | 168 | 165 | 800.0 | 1.0 | 800.0 | 1.0 | ||||||||
4 | 3800.0 | H | False | ... | AIC | Wildfire | WF | 169 | 165 | 0.0 | 1.0 | 0.0 | 1.0 | ||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
120799 | 150.0 | H | False | contained | 150.0 | Acres | ... | Wildfire | WF | 82 | 80 | 0.0 | 1.0 | 0.0 | 1.0 | ||||||
120800 | 150.0 | H | False | contained | 150.0 | Acres | ... | Wildfire | WF | 82 | 80 | 0.0 | 1.0 | 0.0 | 1.0 | ||||||
120801 | 1900.0 | L | False | 3 | everything in canyons is natural vegetation. a... | 2500.0 | Acres | ... | imt has spot weather forecast for next 24 hour... | Wildfire | WF | 234 | 232 | 1900.0 | 2.0 | 950.0 | 1.0 | ||||
120802 | 1900.0 | L | False | 3 | none | everything in canyons is natural vegetation. a... | 1900.0 | Acres | ... | imt has spot weather forecast for next 24 hour... | Wildfire | WF | 235 | 232 | 0.0 | 1.0 | 0.0 | 1.0 | |||
120803 | 1900.0 | L | False | 3 | none | 1900.0 | Acres | ... | Wildfire | WF | 235 | 232 | 0.0 | 1.0 | 0.0 | 1.0 |
120804 rows × 114 columns
Prepare data
This step provides simple data preparation including: combining columns, removing incomplete rows, and creating unique ids
[6]:
ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)
Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:23<00:00, 1455.22it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20485.25it/s]
data preparation: 1.44 minutes
If the columns were combined and you are only interested in the combine text, reset the text column variable:
[7]:
ICS_data.text_columns = ['Combined Text']
Preprocessing
The preprocessing step provides a more thorough preprocessing required for traditional NLP methods, such as LDA and hLDA topic modeling. For preprocessing, you should define:
any additional domain-specific stop words
any potential stop words you would like to keep
function arguments
Function arguments correspond to optional preprocessing steps, such as ngrams, spellcheck, quote mark removal, and dropping short documents.
[8]:
from mika.utils.stopwords.ICS_stop_words import stop_words
ICS_stop_words = stop_words
save_words = ['jurisdictions', 'team', 'command', 'organization', 'type', 'involved', 'transition', 'transfer', 'impact', 'concern', 'site', 'nation', 'political', 'social', 'adjacent', 'community', 'cultural', 'tribal', 'monument', 'archeaological', 'highway', 'traffic', 'road', 'travel', 'interstate', 'closure', 'remain', 'remains', 'close', 'block', 'continue', 'impact', 'access', 'limit', 'limited', 'terrain', 'rollout', 'snag', 'steep', 'debris', 'access', 'terrian', 'concern', 'hazardous', 'pose', 'heavy', 'rugged', 'difficult', 'steep', 'narrow', 'violation', 'notification', 'respond', 'law', 'patrol', 'cattle', 'buffalo', 'grow', 'allotment', 'ranch', 'sheep', 'livestock', 'grazing', 'pasture', 'threaten', 'concern', 'risk', 'threat', 'evacuation', 'evacuate', ' threaten', 'threat', 'resident', ' residence', 'level', 'notice', 'community', 'structure', 'subdivision', 'mandatory', 'order', 'effect', 'remain', 'continue', 'issued', 'issue', 'injury', 'hospital', 'injured', 'accident', 'treatment', 'laceration', 'firefighter', 'treated', 'minor', 'report', 'transport', 'heat', 'shoulder', 'ankle', 'medical', 'released', 'military', 'unexploded', 'national', 'training', 'present', 'ordinance', 'guard', 'infrastructure', 'utility', 'powerline', 'water', 'electric', 'pipeline', 'powerlines', 'watershed', 'pole', 'power', 'gas', 'concern', 'near', 'hazard', 'critical', 'threaten', 'threat', 'off', 'weather', 'behavior', 'wind', 'thunderstorm', 'storm', 'gusty', 'lightning', 'flag', 'unpredictable', 'extreme', 'erratic', 'strong', 'red', 'warning', 'species', 'specie', 'habitat', 'animal', 'plant', 'conservation', 'threaten', 'endanger', 'threat', 'sensitive', 'threatened', 'endangered', 'risk', 'loss', 'impacts', 'unstaffed', 'resources', 'support', 'crew', 'aircraft', 'helicopter', 'engines', 'staffing', 'staff', 'lack', 'need', 'shortage', 'minimal', 'share', 'necessary', 'limited', 'limit', 'fatigue', 'flood', 'flashflood', 'flash', 'risk', 'potential', 'mapping', 'map', 'reflect', 'accurate', 'adjustment', 'change', 'reflect', 'aircraft', 'heli', 'helicopter', 'aerial', 'tanker', 'copter', 'grounded', 'ground', 'suspended', 'suspend', 'smoke', 'impact', 'hazard', 'windy', 'humidity', 'moisture', 'hot', 'drought', 'low', 'dry', 'prolonged']
[9]:
ICS_data.preprocess_data(domain_stopwords=ICS_stop_words, ngrams=False, save_words=save_words)
Preprocessing Combined Text…: 100%|██████████| 100/100 [57:47<00:00, 34.68s/it]
Removing frequent words…: 100%|██████████| 1/1 [29:07<00:00, 1747.63s/it]
Processing time: 86.94 minutes
Sentence Tokenization
Additional processing options include sentence tokenization. Note that this must be applied to non-preprocessed data.
[7]:
text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})
ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)
ICS_data.sentence_tokenization()
Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:22<00:00, 1469.66it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20735.39it/s]
data preparation: 1.42 minutes
Sentence Tokenization…: 100%|██████████| 37350/37350 [01:32<00:00, 404.89it/s]
Save results
After preprocessing your dataset, make sure to save it
[11]:
ICS_data.save()
Loading preprocessed data
Previously processed MIKA data objects can be reloaded in
[16]:
file = os.path.join('..','..','data','ICS','ICS_filtered_preprocessed_combined_data.csv')
ICS_data = Data()
ICS_data.load(file, preprocessed=True, id_col=document_id_col, text_columns=["Combined Text"], name='ICS', load_kwargs={'dtype':str})
[15]:
ICS_data.data_df
[15]:
CY | DISCOVERY_DATE | INCIDENT_ID | PCT_CONTAINED_COMPLETED | START_YEAR | TOTAL_AERIAL | TOTAL_PERSONNEL | REPORT_DOY | DISCOVERY_DOY | Combined Text | ... | EST_IM_COST_TO_DATE | STR_DAMAGED | STR_DESTROYED | NEW_ACRES | POO_STATE | POO_LATITUDE | POO_LONGITUDE | WEATHER_CONCERNS_NARR | INC_MGMT_ORG_ABBREV | EVACUATION_IN_PROGRESS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | 2010-07-15 15:00:00 | 2000_CA-RRU-062485_VALLEY COMPLEX | 80.0 | 2010.0 | 5.000000 | 230.000000 | 197 | 196 | [resource, share, cactus] | ... | 10000.0 | 0.0 | 0.0 | 70.0 | CA | 33.619444 | -116.9 | Current Weather RH: 45 TEMP: 75 WS: 10 WD: VAR... | 4 | False |
1 | 2010 | 2010-07-15 15:00:00 | 2000_CA-RRU-062485_VALLEY COMPLEX | 60.0 | 2010.0 | 5.000000 | 230.000000 | 197 | 196 | [resource, share, incident, cactus, incident, ... | ... | 90000.0 | 0.0 | 0.0 | 503.0 | CA | 33.666667 | -117.0 | Current Weather RH: 52 TEMP: 78 WS: 12 WD: Var... | 4 | False |
2 | 2010 | 2010-07-15 15:00:00 | 2000_CA-RRU-062485_VALLEY COMPLEX | 30.0 | 2010.0 | 4.000000 | 165.000000 | 197 | 196 | [resource, share, cactus, erratic, wind, due, ... | ... | 45000.0 | 0.0 | 0.0 | 450.0 | CA | 33.65 | -116.9 | Current Weather RH: 52 TEMP: 78 WS: 12 WD: VAR... | 4 | False |
3 | 2010 | 2010-07-15 15:00:00 | 2000_CA-RRU-062485_VALLEY COMPLEX | 100.0 | 2010.0 | 4.333333 | 192.333333 | 197 | 196 | [resource, share, cactus, cactus, become, vall... | ... | 10000.0 | 0.0 | 0.0 | 10.0 | CA | 33.619444 | -116.9 | Current Weather RH: 21 TEMP: 102 WS: 12 WD: SW... | 4 | False |
4 | 2010 | 2010-07-15 15:00:00 | 2000_CA-RRU-062485_VALLEY COMPLEX | 60.0 | 2010.0 | 4.333333 | 192.333333 | 197 | 196 | [resource, share, cactus, cactus, become, vall... | ... | 50000.0 | 0.0 | 0.0 | 0.0 | CA | 33.666667 | -117.0 | Current Weather RH: 22 TEMP: 104 WS: 15 WD: SW | 4 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
44358 | 2014 | 2014-03-15 14:30:00 | 2014_VAVAS1403037_BEAVER LODGE RD. | 100.0 | 2014.0 | 0.000000 | 13.000000 | 74 | 74 | [fast, spread, field] | ... | 450.0 | 1.0 | 1.0 | 0.0 | VA | 38.521389 | -77.576111 | |||
44359 | 2014 | 2014-03-19 14:00:00 | 2014_VAVAS1406037_AIRPORT MOUNTAIN | 85.0 | 2014.0 | 0.000000 | 18.500000 | 80 | 78 | [heavy, plume, primary, carrier] | ... | 4000.0 | 0.0 | 0.0 | 85.0 | VA | 37.243056 | -82.103333 | |||
44360 | 2014 | 2014-08-20 13:00:00 | 2014_WA-WFS-513_SAND RIDGE | 0.0 | 2014.0 | 1.000000 | 95.000000 | 234 | 232 | [heavy, canyon, river, mainly, canyon, come, e... | ... | 50000.0 | 0.0 | 0.0 | 1900.0 | WA | 46.000395 | -120.024213 | imt has spot weather forecast for next 24 hour... | ||
44361 | 2014 | 2014-08-20 13:00:00 | 2014_WA-WFS-513_SAND RIDGE | 86.0 | 2014.0 | 1.000000 | 120.000000 | 235 | 232 | [laid, night, test, wind, remain, canyon, peri... | ... | 185000.0 | 0.0 | 0.0 | 0.0 | WA | 46.000395 | -120.024213 | imt has spot weather forecast for next 24 hour... | ||
44362 | 2014 | 2014-08-20 13:00:00 | 2014_WA-WFS-513_SAND RIDGE | 100.0 | 2014.0 | 0.000000 | 46.000000 | 235 | 232 | [report, incident, wind, test, overnight] | ... | 200000.0 | 0.0 | 0.0 | 0.0 | WA | 46.000395 | -120.024213 |
44363 rows × 25 columns