Topic Model Plus Example
This example goes over topic model plus and includes steps for performing:
LDA topic modeling
hLDA topic modeling
BERTopic modeling
First, we load our data, in this case the ICS-209-PLUS data set, filter it, then set up our topic models
[1]:
import sys
import os
sys.path.append(os.path.join("..",".."))
from mika.kd import Topic_Model_plus
from mika.utils import Data
from mika.utils.stopwords.ICS_stop_words import stop_words
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
import os
import pandas as pd
ICS_stop_words = stop_words
Data Preparation for BERTopic
[2]:
# load data for BERTopic
ICS_data = Data()
file_name = os.path.join('..','..','data','ICS','ics209-plus-wf_sitreps_1999to2014.csv')
text_columns = ["REMARKS", "SIGNIF_EVENTS_SUMMARY", "MAJOR_PROBLEMS"]
document_id_col = "INCIDENT_ID"
ICS_data.load(file_name, id_col=document_id_col, text_columns=text_columns, load_kwargs={'dtype':str})
ICS_data.prepare_data(create_ids=True, combine_columns=text_columns, remove_incomplete_rows=True)
ICS_data.text_columns = ['Combined Text']
save_words = ['jurisdictions', 'team', 'command', 'organization', 'type', 'involved', 'transition', 'transfer', 'impact', 'concern', 'site', 'nation', 'political', 'social', 'adjacent', 'community', 'cultural', 'tribal', 'monument', 'archeaological', 'highway', 'traffic', 'road', 'travel', 'interstate', 'closure', 'remain', 'remains', 'close', 'block', 'continue', 'impact', 'access', 'limit', 'limited', 'terrain', 'rollout', 'snag', 'steep', 'debris', 'access', 'terrian', 'concern', 'hazardous', 'pose', 'heavy', 'rugged', 'difficult', 'steep', 'narrow', 'violation', 'notification', 'respond', 'law', 'patrol', 'cattle', 'buffalo', 'grow', 'allotment', 'ranch', 'sheep', 'livestock', 'grazing', 'pasture', 'threaten', 'concern', 'risk', 'threat', 'evacuation', 'evacuate', ' threaten', 'threat', 'resident', ' residence', 'level', 'notice', 'community', 'structure', 'subdivision', 'mandatory', 'order', 'effect', 'remain', 'continue', 'issued', 'issue', 'injury', 'hospital', 'injured', 'accident', 'treatment', 'laceration', 'firefighter', 'treated', 'minor', 'report', 'transport', 'heat', 'shoulder', 'ankle', 'medical', 'released', 'military', 'unexploded', 'national', 'training', 'present', 'ordinance', 'guard', 'infrastructure', 'utility', 'powerline', 'water', 'electric', 'pipeline', 'powerlines', 'watershed', 'pole', 'power', 'gas', 'concern', 'near', 'hazard', 'critical', 'threaten', 'threat', 'off', 'weather', 'behavior', 'wind', 'thunderstorm', 'storm', 'gusty', 'lightning', 'flag', 'unpredictable', 'extreme', 'erratic', 'strong', 'red', 'warning', 'species', 'specie', 'habitat', 'animal', 'plant', 'conservation', 'threaten', 'endanger', 'threat', 'sensitive', 'threatened', 'endangered', 'risk', 'loss', 'impacts', 'unstaffed', 'resources', 'support', 'crew', 'aircraft', 'helicopter', 'engines', 'staffing', 'staff', 'lack', 'need', 'shortage', 'minimal', 'share', 'necessary', 'limited', 'limit', 'fatigue', 'flood', 'flashflood', 'flash', 'risk', 'potential', 'mapping', 'map', 'reflect', 'accurate', 'adjustment', 'change', 'reflect', 'aircraft', 'heli', 'helicopter', 'aerial', 'tanker', 'copter', 'grounded', 'ground', 'suspended', 'suspend', 'smoke', 'impact', 'hazard', 'windy', 'humidity', 'moisture', 'hot', 'drought', 'low', 'dry', 'prolonged']
# filter data
file = os.path.join('..','..','data','ICS','summary_reports_cleaned.csv')
filtered_df = pd.read_csv(file, dtype=str)
filtered_ids = filtered_df['INCIDENT_ID'].unique()
ICS_data.data_df = ICS_data.data_df.loc[ICS_data.data_df['INCIDENT_ID'].isin(filtered_ids)].reset_index(drop=True)
ICS_data.doc_ids = ICS_data.data_df['Unique IDs'].tolist()
# save raw text for bertopic
raw_text = ICS_data.data_df[ICS_data.text_columns]
ICS_data.sentence_tokenization()
Removing Incomplete Rows…: 100%|██████████| 120804/120804 [01:22<00:00, 1462.63it/s]
Creating Unique IDs…: 100%|██████████| 37350/37350 [00:01<00:00, 20368.50it/s]
data preparation: 1.43 minutes
Sentence Tokenization…: 100%|██████████| 26397/26397 [00:37<00:00, 703.36it/s]
Data Preparation for LDA and hLDA
[3]:
# load data for LDA/hLDA
file = os.path.join('..','..','data','ICS','ICS_filtered_preprocessed_combined_data.csv')
ICS_data_processed = Data()
ICS_data_processed.load(file, preprocessed=True, id_col='Unique IDs', text_columns=['Combined Text'], name='ICS')
Initiate topic model plus object
[4]:
ICS_tm = Topic_Model_plus(text_columns=['Combined Text Sentences'], data=ICS_data)
BERTopic
To use BERTopic in MIKA, you can define:
a vectorizor model, which creates ngrams while excluding stopwords
seed topics
One key difference between MIKA and the base BERTopic is that MIKA has a from_probs argument that allows users to assign topics to documents based on a probability threshold, whereas traditional BERTopic only assigns one topic to each document.
[5]:
from nltk.corpus import stopwords
total_stopwords = stopwords.words('english')+ICS_stop_words
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=total_stopwords) #removes stopwords
seed_topic_list = [['highway', 'traffic', 'road', 'travel', 'interstate', 'closure', 'remain', 'remains', 'close', 'block', 'impact', 'access', 'limit', 'limited'],
['transition', 'transfer'],
['evacuation', 'evacuate',],
['mapping', 'map', 'reflect', 'accurate', 'adjustment', 'change', 'reflect', 'inaccurate'],
['aerial','inversion', 'suspend', 'suspendsion', 'prohibit', 'delay', 'hamper', 'unable', 'cancel', 'inability', 'loss', 'curtail', 'challenge', 'smoke'],
['unstaffed', 'resource', 'lack', 'need', 'shortage', 'minimal', 'share', 'necessary', 'limited', 'limit', 'fatigue'],
['injury', 'hospital', 'injured', 'accident', 'treatment', 'laceration', 'firefighter', 'treat', 'minor', 'report', 'transport', 'heat', 'shoulder', 'ankle', 'medical', 'release'],
['cultural', 'tribal', 'monument', 'archaeological', 'heritage', 'site', 'nation', 'political', 'social', 'adjacent', 'community'],
['cattle', 'buffalo', 'allotment', 'ranch', 'sheep', 'livestock', 'grazing', 'pasture', 'threaten', 'concern', 'risk', 'threat', 'private', 'area', 'evacuate', 'evacuation', 'order'],
['violation', 'arson', 'notification', 'respond', 'law'],
['military', 'unexploded', 'training', 'present', 'ordinance', 'proximity', 'activity', 'active', 'base', 'area'],
['infrastructure', 'utility', 'powerline', 'water', 'electric', 'pipeline', 'powerlines', 'watershed', 'pole', 'power', 'gas'],
['weather', 'behavior', 'wind', 'thunderstorm', 'storm', 'gusty', 'lightning', 'flag', 'unpredictable', 'extreme', 'erratic', 'strong', 'red', 'warning', 'warn'],
['species', 'habitat', 'animal', 'plant', 'conservation', 'threaten', 'endanger', 'threat', 'sensitive', 'risk', 'loss', 'impact'],
['terrain', 'rollout', 'snag', 'steep', 'debris', 'access', 'concern', 'hazardous', 'pose', 'heavy', 'rugged', 'difficult', 'steep', 'narrow'],
['humidity', 'moisture', 'hot', 'drought', 'low', 'dry', 'prolong']]
[7]:
BERTkwargs={"seed_topic_list":seed_topic_list,
"top_n_words": 20,
'min_topic_size':150}
ICS_tm.bert_topic(count_vectorizor=vectorizer_model, BERTkwargs=BERTkwargs, from_probs=True)
ICS_tm.save_bert_results(from_probs=True) #warning: saving in excel can result in missing data when char limit is reached
ICS_tm.save_bert_topics_from_probs()
#get coherence
ICS_tm.save_bert_coherence(coh_method='c_v')
ICS_tm.save_bert_coherence(coh_method='c_npmi')
ICS_tm.save_bert_vis()
ICS_tm.save_bert_model()
2023-09-19 09:11:58,268 - BERTopic - Transformed documents to Embeddings
2023-09-19 09:23:08,908 - BERTopic - Reduced dimensionality
2023-09-19 09:26:58,891 - BERTopic - Clustered reduced embeddings
LDA Topic Modeling
LDA topic modeling is a wrapper for tomotopy (https://bab2min.github.io/tomotopy/v/en/) and requires preprocessed text. Users must specify the number of topics, while other arguments are optional.
[5]:
ICS_tm = Topic_Model_plus(text_columns=['Combined Text'], data=ICS_data_processed)
[5]:
text_columns = ["Combined Text"]
num_topics = {attr:50 for attr in text_columns}
ICS_tm.lda(min_cf=1, num_topics=num_topics, min_df=1, alpha=1, eta=0.0001)
ICS_tm.save_lda_results()
ICS_tm.save_lda_models()
for attr in text_columns:
ICS_tm.lda_visual(attr)
Combined Text LDA…: 100%|██████████| 100/100 [01:40<00:00, 1.01s/it]
LDA: 1.7821103811264039 minutes
hLDA Topic Modeling
hLDA topic modeling is a wrapper for tomotopy (https://bab2min.github.io/tomotopy/v/en/) and requires preprocessed text. Users must specify the number of levels in the hierarchical model, while other arguments are optional.
[6]:
ICS_tm.hlda(levels=3, eta=0.50, min_cf=1, min_df=1)
ICS_tm.save_hlda_models()
ICS_tm.save_hlda_results()
Combined Text hLDA…: 100%|██████████| 100/100 [1:32:36<00:00, 55.57s/it]
hLDA: 92.72793128490449 minutes
Loading existing models
BERTopic
Previously trained models can be loaded back into topic model plus for new inference or further training, using BERTopics load feature
[5]:
model_path = os.path.join(r"C:\Users\srandrad\smart_nlp\examples\KD\topic_model_resultsSep-19-2023")
ICS_tm = Topic_Model_plus(text_columns=['Combined Text Sentences'], data=ICS_data)
ICS_tm.load_bert_model(model_path, reduced=False, from_probs=True)
ICS_tm.save_bert_results()
hLDA/LDA
Similarly, previously trained hLDA and LDA models can be loaded back into topic model plus
[6]:
text_columns = ["Combined Text"]
ICS_tm = Topic_Model_plus(text_columns=text_columns, data=ICS_data)
ICS_tm.combine_cols = True
filepath = os.path.join("topic_model_results")
ICS_tm.hlda_extract_models(filepath)
ICS_tm.save_hlda_results()