kd package
mika.kd.FMEA
- class mika.kd.FMEA
A class used for generated FMEAs from a data set of reports.
- build_fmea(severity_func, group_by, year_col, group_by_kwargs={}, post_process_kwargs={}, save=True)
Builds the FMEA using the above functions, all in one call. Less customizable, but useful for a quick implementation.
- Parameters:
severity_func (function) – DESCRIPTION.
group_by (‘string’) – method to group together the FMEA rows, either manual file or by meta data.
year_col (string) – The column the year for the report is stored in.
group_by_kwargs (dict, optional) – dictionary containing all inputs for the group_by function. The default is {}.
post_process_kwargs (dict, optional) – dictionary containing all inputs for the post_process_fmea function. The default is {}.
- Return type:
None.
- calc_frequency(year_col)
Calculates the frequency for each row and assigns it a category
- Parameters:
year_col (string) – The column the year for the report is stored in.
- Returns:
grouped_df – Grouped df with requency column added
- Return type:
DataFrame
- calc_risk()
Calculates risk as the product of severity and frequency. Adds risk column to the grouped df.
- Return type:
None.
- calc_severity(severity_func, from_file=False, file_name='', file_kwargs={})
Calculates the severity for each row according to a defined severity function.
- Parameters:
severity_func (function) – User defined function for calculating severity. Usually a linear combination of other values.
from_file (Boolean, optional) – True if the severity value is already stored in a file, false if calculated from a severiy function. The default is False.
file_name (string, optional) – filepath to a spread sheet containing the severity value for each document. The default is ‘’.
file_kwargs (dict, optional) – any kwargs needed to read the file. Typically needed for .xlsx workbooks with multiple sheets. The default is {}.
- Return type:
None.
- display_doc(doc_id, save=True, output_path='', colors_path=None, pred=True)
Displays an annotated document with entities highlighted accordingly
- Parameters:
doc_id (string) – The id of the document to be dispalyed
save (Boolean, optional) – Saves as html if true. Displays if False. The default is True.
output_path (string, optional) – The filepath the display will be saved to. The default is “”.
colors_path (string, optional) – The path to a file that defines the colors to be used for each entity. The default is None.
pred (Boolean, optional) – True if the displayed document is from predictions, False if from manual annotations. The default is True.
- Returns:
html – DESCRIPTION.
- Return type:
TYPE
- evaluate_preds(cm=True, class_report=True)
Can only be used if the input data is labeled. Evaluates the performance of the NER model against labeled data.
- Parameters:
cm (Boolean, optional) – Creates a confusion matrix if True. The default is True.
class_report (Boolean, optional) – Creates a classification report if True. The default is True.
- Returns:
return_vals – Dict containing confusion matrix and classification report if specified.
- Return type:
Dictionary
- get_entities_per_doc(pred=True)
Gets all entites for each document. Note that this is required because the NER model is run on sentences. This function reconstructs the documents from the sentences, while preserving the entities.
- Parameters:
pred (Boolean, optional) – True if the entities per doc are from predicted entities. False if the entities per doc are from labels. The default is True.
- Returns:
pandas DataFrame with documents as rows, entities as columns
- Return type:
data_df
- get_year_per_doc(year_col, config='/')
Used to convert dates to years prior to calculating frequency
- Parameters:
year_col (string,) – The colomn in the raw dataframe with the date information.
config (TYPE, optional) – DESCRIPTION. The default is ‘/’.
- Return type:
None.
- group_docs_manual(filename, grouping_col, additional_cols=[], sample=1)
Creates FMEA rows by grouping together documents according to values manually defined in a separate file. Loads in the file and then aggregates the data. Sample IDs for documents in each row are created as well.
- Parameters:
filename (string) – filepath to the spreadsheet defining the rows.
grouping_col (string) – The column within the spreadsheet that defines the rows.
additional_cols (list, optional) – Additional columns to include in the FMEA.
sample (int, optional) – Number of samples to pull for each FMEA row. The default is 1.
- Returns:
grouped_df – The grouped FMEA dataframe
- Return type:
DataFrame
- group_docs_with_meta(grouping_col, additional_cols=[], sample=1)
Groups documents into an FMEA using a grouping column, which is metadata from the initial dataset.
- Parameters:
grouping_col (string) – The column in the original dataset used to group documents into FMEA rows.
additional_cols (list of strings, optional) – additional columns in a dataset to include in the FMEA. The default is [].
sample (int, optional) – Number of samples to pull for each FMEA row. The default is 1.
- Returns:
grouped_df – The grouped FMEA dataframe
- Return type:
DataFrame
- load_data(text_col, id_col, filepath='', df=None, formatted=False, label_col='labels')
Loads data to prepare for FMEA extraction. Sentence tokenization is performed for preprocessing, and the raw data is also saved. Can input a filepath for a .jsonl (annotations from doccano) or .csv file. Can also instead input a dataframe already loaded in. Can also instead input a huggingface dataset object location. Saves data formatted to input into the NER model. Requires spacy en_core_web_trf downloaded.
- Parameters:
text_col (string) – The column where the text used for FMEA extraction is stored.
id_col (string) – The id column in the dataframe.
filepath (string, optional) – Can input a filepath for a .jsonl (annotations from doccano) or .csv file. The default is ‘’.
df (pandas DataFrame, optional) – Can instead input a pandas DataFrame already loaded in, with one column of text. The default is None.
formatted (Bool, optional) – True if the input in filepath is a formatted dataset object. The default is False.
label_col (string, optional) – The column containing annotation labels if the data is annotated. The default is “labels”.
- Return type:
None.
- load_model(model_checkpoint='NASA-AIML/MIKA_BERT_FMEA_NER', device=-1)
Loads in a fine-tuned custom NER model trained to extract FMEA entities. If no checkpoint is passed, the custom model from MIKA is used.
- Parameters:
model_checkpoint (string, optional) – model check point, can be from huggingface or a path from personal device
device (int, optional) – device to run model on. Default is -1 to run on cpu
- Return type:
None.
- post_process_fmea(id_name='ID', phase_name='Mission Type', max_words=20)
Post processes the FMEA to identify the column that contains the phase name, clean sub-word tokens, and limit the number of words per cell.
- Parameters:
id_name (string, optional) – Name of dataset used/name over the id column. The default is ‘SAFECOM’.
phase_name (string, optional) – Column that can be used to find the phase of operation. The default is ‘Mission Type’.
max_words (int, optional) – Maximum number of words in a cell in the FMEA. The default is 20.
- Returns:
fmea_df – FMEA post processed DataFrame
- Return type:
DataFrame
- predict()
Performs named entity recognition on the input data.
- Returns:
Predicted entities per each document
- Return type:
Preds
mika.kd.Topic_Model_plus
- class mika.kd.Topic_Model_plus(text_columns=[], data=None, ngrams=None, results_path='')
Topic model plus
A class for topic modeling for aviation safety.
- Variables:
text_columns (list) – defines various columns within a single database which will be used for topic modeling
data (Data) – Data object storing the text corpus
ngrams (str) – ‘tp’ if the user wants tomotopy to form ngrams prior to applying a topic model
doc_ids (list) – list of document ids pulled from data object
data_df (pandas dataframe) – df storing documents pulled from data object
data_name (string) – dataset name pulled from data object
id_col (string) – the column storing the document ids pulled from data object
hlda_models (dictionary) – variable for storing hlda models
lda_models (dictionary) – variable for storing lda models
folder_path (string) – destination for storing results and models
results_path (string) – destentation to create results folder in
- bert_topic(sentence_transformer_model=None, umap=None, hdbscan=None, count_vectorizor=None, ngram_range=(1, 3), BERTkwargs={}, from_probs=True, thresh=0.01)
Train a bertopic model.
- Parameters:
sentence_transformer_model (BERT model object, optional) – BERT model object used for embeddings. The default is None.
umap (umap model object, optional) – umap model object used for dimensionality reduction. The default is None.
hdbscan (hdbscan model object, optional) – hdbscan model object for clustering. The default is None.
count_vectorizor (count vectorizor object, optional) – count vectorizor object that is used for ctf-idf. The default is None.
ngram_range (tuple, optional) – range of ngrams to be considered. The default is (1,3).
BERTkwargs (dict, optional) – dictionary of kwargs passed into bertopic. The default is {}.
from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.
thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.
- Return type:
None.
- calc_bert_coherence(docs, topics, topic_model, method='u_mass', num_words=10)
Calculates coherence for a bertopic model using gensim coherence models.
- Parameters:
docs (list) – List of document text.
topics (List) – List of topics per document.
topic_model (BERTopic model object) – Object containing the trained topic model.
method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.
num_words (int, optional) – Number of words in the topic used to calculate coherence. The default is 10.
- Returns:
coherence_per_topic – List of coherence scores for each topic.
- Return type:
List
- coherence_scores(mdl, lda_or_hlda, measure='c_v')
Computes and returns coherence scores for lda and hlda models.
- Parameters:
mdl (lda or hlda model object) – topic model object created previously
lda_or_hlda (str) – denotes whether coherence is being calculated for lda or hlda
measure (string, optional) – denotes which coherence metric to compute. The default is ‘c_v’.
- Returns:
scores – coherence scores, averages, and std dev
- Return type:
dict
- get_bert_coherence(coh_method='u_mass', from_probs=False)
Gets coherence for bert models and saves it in a dictionary.
- Parameters:
coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.
from_probs (boolean, optional) – Whether or not to use document topic probabilities to assign topics. True to use probabilities - i.e., each document can have multiple topics. False to not use probabilities - i.e., each document only has one topics. The default is False.
- Return type:
None.
- get_bert_topic_diversity(topk=10)
Gets topic diversity scores for a BERTopic model.
- Parameters:
topk (int, optional) – Number of words per topic used to calculate diversity. The default is 10.
- Return type:
None.
- get_bert_topics_from_probs(topic_df, thresh=0.01, coherence=False)
Saves topic model results including each topic number, words, number of words, and best document when document topics are defined by a probabilty threshold.
- Parameters:
topic_df (dictionary of dataframes) – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results
thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.
coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.
- Returns:
new_topic_dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results
- Return type:
dictionary of dataframes
- hdp(training_iterations=1000, iteration_step=10, to_lda=True, kwargs={}, topic_threshold=0.0)
Performs HDP topic modeling which is useful when the number of topics is not known.
- Parameters:
training_iterations (int, optional) – number of training iterations. The default is 1000.
iteration_step (int, optional) – number of steps per iteration. The default is 10.
to_lda (boolean, optional) – True to convert the hdp model to an lda model. The default is True.
kwargs (dict, optional) – kwargs to pass into the hdp model. The default is {}.
topic_threshold (float, optional) – probability threshold used when converting hdp topics to lda topics. The default is 0.0.
- Return type:
None.
- hlda(levels=3, training_iterations=1000, iteration_step=10, **kwargs)
Performs hlda topic modeling.
- Parameters:
levels (int, optional) – number of hierarchical levels. The default is 3.
training_iterations (int, optional) – number of training iterations. The default is 1000.
iteration_step (int, optional) – number of steps per iteration. The default is 10.
**kwargs (dict) – any kwargs for the hlda topic model
- Return type:
None.
- hlda_display(col, num_words=5, display_options={'level 1': 1, 'level 2': 6}, colors='bupu', filename='')
Saves graphviz visualization of hlda tree structure.
- Parameters:
col (string) – column of interest
num_words (int, optional) – number of words per node. The default is 5.
display_options (dictionary, optional) – nested dictiary where keys are levels and values are the max number of nodes. The default is {“level 1”: 1, “level 2”: 6}.
colors (string, optional) – brewer colorscheme used, default is blue-purple see http://graphviz.org/doc/info/colors.html#brewer for options. The default is ‘bupu’.
filename (string, optional) – can input a filename for where the topics are stored in order to make display after hlda; must be an ouput from “save_hlda_topics()” or hlda.bin object. The default is ‘’.
- Return type:
None.
- hlda_extract_models(file_path)
Gets hlda models from file.
- Parameters:
file_path (string) – path to file
- Return type:
None.
- hlda_visual(col)
Saves pyLDAvis output from hlda to file.
- Parameters:
col (str) – reference to column of interest
- Return type:
None.
- label_hlda_topics(extractor_min_cf=5, extractor_min_df=3, extractor_max_len=5, extractor_max_cand=5000, labeler_min_df=5, labeler_smoothing=0.01, labeler_mu=0.25, label_top_n=3)
Uses tomotopy’s auto topic labeling tool to label topics. Stores labels in class; after running this function, a flag can be used to use labels or not in taxonomy saving functions.
- Parameters:
extractor_min_cf (int) – from tomotopy docs: “minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big”
extractor_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”
extractor_max_len (int) – from tomotopy docs: “maximum length of collocations”
extractor_max_cand (int) – from tomotopy docs: “maximum number of candidates to extract”
labeler_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”
labeler_smoothing (float) – from tomotopy docs: “a small value greater than 0 for Laplace smoothing”
labeler_mu (float) – from tomotopy docs: “a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get the higher final score when this value is the larger.”
label_top_n (int) – from tomotopy docs: “the number of labels”
- label_lda_topics(extractor_min_cf=5, extractor_min_df=3, extractor_max_len=5, extractor_max_cand=5000, labeler_min_df=5, labeler_smoothing=0.01, labeler_mu=0.25, label_top_n=3)
Uses tomotopy’s auto topic labeling tool to label topics. Stores labels in class; after running this function, a flag can be used to use labels or not in taxonomy saving functions.
- Parameters:
extractor_min_cf (int) – from tomotopy docs: “minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big”
extractor_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”
extractor_max_len (int) – from tomotopy docs: “maximum length of collocations”
extractor_max_cand (int) – from tomotopy docs: “maximum number of candidates to extract”
labeler_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”
labeler_smoothing (float) – from tomotopy docs: “a small value greater than 0 for Laplace smoothing”
labeler_mu (float) – from tomotopy docs: “a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get the higher final score when this value is the larger.”
label_top_n (int) – from tomotopy docs: “the number of labels”
- lda(num_topics={}, training_iterations=1000, iteration_step=10, max_topics=0, **kwargs)
Performs LDA topic modeling.
- Parameters:
num_topics (dict, optional) – keys are columns in text_columns, values are the number of topics lda forms optional - if omitted, lda optimization is run and produces the num_topics The default is {}.
training_iterations (int, optional) – number of training iterations. The default is 1000.
iteration_step (int, optional) – number of steps per iteration. The default is 10.
max_topics (int, optional) – maximum number of topics to consider The default is 0.
**kwargs (dict) – any kwargs for the lda topic model.
- Return type:
None.
- lda_extract_models(file_path)
Loads lda models from file.
- Parameters:
file_path (str) – path to file
- Return type:
None.
- lda_visual(col)
Saves pyLDAvis output from lda to file.
- Parameters:
col (str) – reference to column of interest
- Return type:
None.
- load_bert_model(file_path, reduced=False, from_probs=False, thresh=0.01)
Loads trained bertopic model(s).
- Parameters:
file_path (string) – file path to the folder storing the model(s)
reduced (bool, optional) – True if the model is a reduced topic model, by default False
from_probs (bool, optional) – Whether or not to use document topic probabilities to assign topics. True to use probabilities - i.e., each document can have multiple topics. False to not use probabilities - i.e., each document only has one topics, by default False
thresh (float, optional) – probability threshold used when from_probs=True, by default 0.01
- reduce_bert_topics(num=30, from_probs=False, thresh=0.01)
Reduces the number of topics in a trained bertopic model to the specified number.
- Parameters:
num (int optional) – number of topics in the reduced model. The default is 30.
from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.
thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.
- Return type:
None.
- save_bert_coherence(return_df=False, coh_method='u_mass', from_probs=False)
Saves the coherence scores for a bertopic model.
- Parameters:
return_df (boolean, optional) – True to return the coherence_df. The default is False.
coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.
from_probs (boolean, optional) – Whether or not to use document topic probabilities to assign topics. True to use probabilities - i.e., each document can have multiple topics. False to not use probabilities - i.e., each document only has one topics. The default is False.
- Returns:
coherence_df – Dataframe with each row a topic, column has coherence scores.
- Return type:
pandas DataFrame
- save_bert_document_topic_distribution(return_df=False)
Saves the document topic distribution.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
- Returns:
doc_df – dataframe with a row for each document and the probability for each topic
- Return type:
pandas DataFrame
- save_bert_model(embedding_model=True)
Saves a BERTopic model.
- Parameters:
embedding_model (boolean, optional) – True to save the embedding model. The default is True.
- Return type:
None.
- save_bert_results(coherence=False, coh_method='u_mass', from_probs=False, thresh=0.01, topk=10)
Saves the taxonomy, coherence, and document topic distribution in one excel file.
- Parameters:
coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.
coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.
from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is False.
thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.
topk (int, optional) – Number of words per topic used to calculate diversity. The default is 10.
- Return type:
None.
- save_bert_taxonomy(return_df=False, p_thres=0.0001)
Saves a taxonomy of topics from bertopic model.
- Parameters:
return_df (boolean, optional) – True to return the results dfs. The default is False.
p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.0001.
- Returns:
taxonomy_df – taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents
- Return type:
pandas Dataframe
- save_bert_topic_diversity(topk=10, return_df=False)
Saves topic diversity score for a bertopic model.
- Parameters:
topk (int, optional) – Number of words per topic used to calculate diversity. The default is 10.
return_df (boolean, optional) – True to return the diversity_df. The default is False.
- Returns:
diversity_df – Dataframe with topic diversity score.
- Return type:
pandas DataFrame
- save_bert_topics(return_df=False, p_thres=0.0001, coherence=False, coh_method='u_mass', from_probs=False)
Saves bert topics results to file.
- Parameters:
return_df (boolean, optional) – True to return the results dfs. The default is False.
p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.0001.
coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.
coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.
from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.
- Returns:
dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results
- Return type:
dictionary of dataframes
- save_bert_topics_from_probs(thresh=0.01, return_df=False, coherence=False, coh_method='u_mass', from_probs=True)
Saves bertopic model results if using probability threshold.
- Parameters:
thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.
return_df (boolean, optional) – True to return the results dfs. The default is False.
coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.
coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.
from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.
- Returns:
topic_prob_dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results
- Return type:
dictionary of dataframes
- save_bert_vis()
Saves the bertopic visualization and hierarchy visualization.
- Return type:
None.
- save_hlda_coherence(return_df=False)
Saves hlda coherence to file.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
- Returns:
coherence_df – Dataframe with each row a topic, column has coherence scores.
- Return type:
pandas DataFrame
- save_hlda_document_topic_distribution(return_df=False)
Saves hlda document topic distribution to file.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
- Returns:
doc_df – dataframe with a row for each document and the probability for each topic
- Return type:
pandas DataFrame
- save_hlda_level_n_taxonomy(lev=1, return_df=False)
Saves hlda taxonomy at level n.
- Parameters:
lev (int, optional) – the level number to save. The default is 1.
return_df (boolean, optional) – True to return the results df. The default is False.
- Returns:
taxonomy_level_df – Taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents
- Return type:
pandas Dataframe
- save_hlda_models()
Saves hlda models to file.
- Return type:
None.
- save_hlda_results()
Saves the taxonomy, level 1 taxonomy, raw topics coherence, and document topic distribution in one excel file.
- Return type:
None.
- save_hlda_taxonomy(return_df=False, use_labels=False, num_words=10)
Saves hlda taxonomy to file.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
use_labels (boolean, optional) – True to use topic labels generated from tomotopy. The default is False.
num_words (int, optional) – Number of words to display in the taxonomy. The default is 10.
- Returns:
taxonomy_df – Taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents
- Return type:
pandas Dataframe
- save_hlda_topics(return_df=False, p_thres=0.001)
Saves hlda topics to file.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.001.
- Returns:
dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results
- Return type:
dictionary of dataframes
- save_lda_coherence(return_df=False)
Saves lda coherence to file or returns the dataframe to another function.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
- Returns:
coherence_df – Dataframe with each row a topic, column has coherence scores.
- Return type:
pandas DataFrame
- save_lda_document_topic_distribution(return_df=False)
Saves lda document topic distribution to file or returns the dataframe to another function.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
- Returns:
doc_df – dataframe with a row for each document and the probability for each topic
- Return type:
pandas DataFrame
- save_lda_models()
Saves lda models to file.
- Return type:
None.
- save_lda_results()
Saves the taxonomy, coherence, and document topic distribution in one excel file.
- Return type:
None.
- save_lda_taxonomy(return_df=False, use_labels=False, num_words=10)
Saves lda taxonomy to file or returns the dataframe to another function.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
use_labels (boolean, optional) – True to use topic labels generated from tomotopy. The default is False.
num_words (int, optional) – Number of words to display in the taxonomy. The default is 10.
- Returns:
taxonomy_df – Taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents
- Return type:
pandas Dataframe
- save_lda_topics(return_df=False, p_thres=0.001)
Saves lda topics to file.
- Parameters:
return_df (boolean, optional) – True to return the results df. The default is False.
p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.001.
- Returns:
dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results
- Return type:
dictionary of dataframes
- save_mixed_taxonomy(use_labels=False)
A custom mixed lda/hlda model taxonomy. Must run lda and hlda with desired parameters first.
- Parameters:
use_labels (boolean, optional) – True to use topic labels. The default is False.
- Return type:
None.
mika.kd.NER
Description: Utility functions for training custom NER models.
- mika.kd.NER.align_labels_with_tokens(labels, word_ids)
Aligns labels and tokens for model training. Adds special tokens to identify tokens not in the tokenizer and the begining of new words.
- Parameters:
labels (list) – List of NER labels where each label is a string
word_ids (list) – List of work ids generated from a tokenizer
- Returns:
new_labels – List of new labels with special token labels added.
- Return type:
list
- mika.kd.NER.build_confusion_matrix(labels, preds, pred_labels, label_list, save=False, savepath='')
Creates confusion matrix for the muliclass NER classification.
- Parameters:
labels (list) – true labels
preds (Numpy Array) – predictions output from model
pred_labels (List) – token labels corresponding to predictions. Used for removing special tokens
label_list (Dict) – Maps label ids to NER labels.
save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.
savepath (string, optional) – Path to save the figure to. The default is “”.
- Returns:
conf_mat (pandas DataFrame) – confusion matrix object
true_predictions (list) – List of model prediction labels
true_labels (list) – List of true labels
- mika.kd.NER.check_doc_to_sentence_split(sentence_df)
Tests that tags are preserved during document to sentence split.
- Parameters:
sentence_df (pandas DataFrame) – Dataframe where each row is a sentence. Has columns ‘sentence’ and ‘tags’
- Return type:
None.
- mika.kd.NER.clean_annots_from_str(df)
Cleans annotated documents by removing excess symbols. Use if the text is a string from a list of words.
- Parameters:
df (pandas dataframe) – Dataframe of the annotated document set.
- Returns:
df – Dataframe of the cleaned annotated document set.
- Return type:
pandas dataframe
- mika.kd.NER.clean_doccano_annots(df)
Cleans annotated documents.
- Parameters:
df (pandas dataframe) – Dataframe of the annotated document set.
- Returns:
df – Dataframe of the cleaned annotated document set.
- Return type:
pandas dataframe
- mika.kd.NER.clean_text_tags(text, labels)
Cleans an individual text and label set by removing extra spaces or punctuation at the beginning or end of a string, fixing annotations that are missing a proceeding or preceeding character, and adding spaces into text that is misisng a space.
- Parameters:
text (string) – text accosicated with the labels
labels (list) – list of labels from annotation where each label in the list is a tuple with structure (1,2,3), where 1 is the beginning character location, 2 is the end character location, and 3 is the label name
- Returns:
new_labels (list) – cleaned list of labels
text (string) – cleaned string corresponding to the labels.
- mika.kd.NER.compute_classification_report(labels, preds, pred_labels, label_list)
Computes classification report.
- Parameters:
labels (list) – true labels
preds (Numpy Array) – predictions output from model
pred_labels (List) – token labels corresponding to predictions. Used for removing special tokens
label_list (Dict) – Maps label ids to NER labels.
- Returns:
Output of classification results
- Return type:
Classification Report
- mika.kd.NER.compute_metrics(eval_preds, id2label)
Calculates sequence classification metrics. Used during trainig.
- Parameters:
eval_preds (Tensor) – Pytorch tensor generated during training/evaluation.
id2label (Dict) – Dict mapping numeric ids to labels.
- Return type:
Dict of metrics.
- mika.kd.NER.get_cleaned_label(label)
removes BILOU component of tag
- Parameters:
label (string) – NER label potentially with BILOU tag.
- Returns:
label – Label with BILOU component removed.
- Return type:
string
- mika.kd.NER.identify_bad_annotations(text_df)
Used for identifying invalid annotations which typically occur from inaccurate tagging, for example, tagging an extra space or punctuation. Must be ran prior to any model training or training will fail on bad annotations.
- Parameters:
text_df (pandas DataFrame) – Dataframe storing the documents. Must contain ‘docs’ column and a ‘tags’ column.
- Returns:
bad_tokens – list of tokens that are invalid.
- Return type:
list
- mika.kd.NER.plot_eval_metrics(eval_df, save, savepath)
Plots classification metrics on the evaluation set over training.
- Parameters:
eval_df (pandas DataFrame) – Dataframe of evaluation metrics
save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.
savepath (string, optional) – Path to save the figure to. The default is “”.
- Return type:
None.
- mika.kd.NER.plot_eval_results(filepath, final_train_metrics={}, final_eval_metrics={}, save=False, savepath=None, loss=True, metrics=True)
Plots evaluation and training metrics as specified using other functions.
- Parameters:
filepath (string) – Path to the checkpoint logs.
final_train_metrics (Dict, optional) – Dictionary of final training metrics in case they are not recorded in the log. The default is {}.
final_eval_metrics (Dict, optional) – Dictionary of final eval metrics in case they are not recorded in the log. The default is {}.
save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.
savepath (string, optional) – Path to save the figure to. The default is “”.
loss (Boolean, optional) – True to plot loss False to not plot loss. The default is True.
metrics (Boolean, optional) – True to plot evaluation metrics. The default is True.
- Return type:
None.
- mika.kd.NER.plot_loss(eval_df, training_df, save, savepath)
Plots the validation and training loss over training.
- Parameters:
eval_df (pandas DataFrame) – Dataframe of evaluation metrics
training_df (pandas DataFrame) – Dataframe of training metrics
save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.
savepath (string, optional) – Path to save the figure to. The default is “”.
- Return type:
None.
- mika.kd.NER.read_doccano_annots(file, encoding=False)
Reads in a .jsonl file containing annotations from doccano.
- Parameters:
file (string) – File location.
encoding (boolean, optional) – True if the file should be opened with utf-8 encoding. The default is False.
- Returns:
df – Dataframe of the annotated document set.
- Return type:
pandas dataframe
- mika.kd.NER.read_trainer_logs(filepath, final_train_metrics, final_eval_metrics)
Reads training logs stored at a checkpoint to get training metrics.
- Parameters:
filepath (string) – Path to the checkpoint logs.
final_train_metrics (Dict) – Dictionary of final training metrics in case they are not recorded in the log
final_eval_metrics (Dict) – Dictionary of final eval metrics in case they are not recorded in the log
- Returns:
eval_df (pandas DataFrame) – Dataframe of evaluation metrics
training_df (pandas DataFrame) – Dataframe of training metrics
- mika.kd.NER.split_docs_to_sentences(text_df, id_col='Tracking #', tags=True)
Splits a dataframe of documents into a new dataframe where each sentence from each document is in its own row. This is useful for using BERT models with maximum character limits.
- Parameters:
text_df (pandas DataFrame) – Dataframe storing the documents. Must contain ‘docs’ column and an id column.
id_col (string, optional) – The column in the dataframe storing the doucment ids. The default is ‘Tracking #’.
tags (boolean, optional) – True if the dataframe also contains tags or annotations. False if no annotations are present. The default is True.
- Returns:
sentence_df – New dataframe where each row is a sentence.
- Return type:
pandas DataFrame
- mika.kd.NER.tokenize(sentence_df, tokenizer)
Tokenizer used during dataset creation when the input data has already been tokenized. Currently unused, can be used to get token ids.
- Parameters:
sentence_df (pandas DataFrame) – Dataframe where each row is a sentence. Has columns ‘sentence’ and ‘tags’
tokenizer (tokenizer object) – tokenizer object, usually an autotokenizer Transformers BERT model
- Returns:
tokenized_inputs – list of tokens with corresponding IDs from the tokenizer
- Return type:
list
- mika.kd.NER.tokenize_and_align_labels(sentence_df, tokenizer, align_labels=True)
Tokenizes text and aligns with labels. Necessary due to subword tokenization with BERT.
- Parameters:
sentence_df (pandas DataFrame) – Dataframe where each row is a sentence. Has columns ‘sentence’ and ‘tags’
tokenizer (tokenizer object) – tokenizer object, usually an autotokenizer Transformers BERT model
align_labels (Boolean, optional) – True if labels should be aligned. The default is True.
- Returns:
tokenized_inputs – Tokenized inputw with new labels alligned to corresponding tokens
- Return type:
Object
mika.kd.trend_analysis
Description: Trend analysis functions.
- mika.kd.trend_analysis.add_hazards_to_docs(preprocessed_df, id_field, docs)
Add hazards to documents
Add hazard values to preprocessed_df for running chi-squared tests
- Parameters:
preprocessed_df (pandas DataFrame) – pandas datframe containing documents.
id_field (string) – the column in preprocessed df that contains document ids
docs (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
- Returns:
preprocessed_df – pandas datframe containing documents.
- Return type:
pandas DataFrame
- mika.kd.trend_analysis.bootstrap_metric(metric_data, time_vals, metric_percentages, num_means=1000, CI_interval=95)
Bootstrap metric
Performs bootstrapping to better estimate the true metric value given a metric percentage (i.e., hazard extraction accuracy).
- Parameters:
metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and list of metrics for values.
time_vals (list) – list of time values in the time series. generated in plot metric time series.
metric_percentages (dictionary) – dict with hazards as keys and values as the percentage to be sampled for each bootstrap. this can be input as the hazard extraction accuracy to try to better capture the true populate mean.
num_means (int, optional) – the number of means calculated via bootstrapping, by default 1000
CI_interval (int, optional) – level of confidence for the interval, by default 95
- Returns:
averages (dictionary) – dictionary with hazards as keys and the value is a list of metric averages with one average per year in the time series.
CI (dictionary) – dictionary with hazards as keys and value is a (2,n) array where n is the number of years of data.
- mika.kd.trend_analysis.build_word_clouds(word_frequencies, nrows, ncols, figsize=(8, 4), cmap=None, save=False, save_path=None, fontsize=10, wordcloud_kwargs={})
Word clouds
Builds a word cloud for each hazard.
- Parameters:
word_frequencies (dictionary) – nested dictionary where keys are hazards. inner dictionary has words as keys and word frequencies as values.
nrows (int) – number of rows in the grid of word clouds
ncols (int) – number of columns in the grid of word clouds
figsize (tuple, optional) – figure size in inches. The default is (8, 4).
cmap (matplotlib colormap, optional) – colormap object used for coloring the word clouds. The default is None.
save (boolean, optional) – true to save figure. The default is False.
save_path (string, optional) – path to save figure to. The default is None.
fontsize (int, optional) – fontsize for title and minimum fontsize in wordcloud. The default is 10.
wordcloud_kwargs (dict, optional) – optional keyword args to pass into the wordcloud object. The default is {}.
- Return type:
None
- mika.kd.trend_analysis.calc_CI(bootstrapped_means, CI_interval=95)
Calculate confidence interval
Calculates a confidence interval from a list of means generated using bootstrapping.
- Parameters:
bootstrapped_means (dictionary) – nested dictionary with hazards as keys, inner dictionary with years as keys and a list of bootstrapped means as the value.
CI_interval (int, optional) – level of confidence for the interval, by default 95
- Returns:
CI – dictionary with hazards as keys and value is a (2,n) array where n is the number of years of data.
- Return type:
dictionary
- mika.kd.trend_analysis.calc_classification_metrics(labeled_file, docs_per_hazard, id_col)
Calculate classification metrics
Calculates classification metrics where the true lables come from a file and the predicted labels are from the hazard extraction process.
- Parameters:
labeled_file (string) – location the labeled file is saved at
docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
id_col (string) – column in preprocessed_df containing document ids
- Returns:
metrics_df (pandas DataFrame) – dataframe with recall, precision, f1, accuracy, and support for each hazard
labeled_docs (pandas DataFrame) – dataframe of the manually labled docs
HEAT_labeled_docs (pandas DataFrame) – dataframe of the HEAT labled docs
- mika.kd.trend_analysis.calc_rate(frequency)
Calculate rate of occurrence of hazard
Calculates the average rate of occurrence for a hazard from the frequency per year.
- Parameters:
frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.
- Returns:
rates – dictionary with hazard name as keys and a rate as a value
- Return type:
dict
- mika.kd.trend_analysis.calc_severity_per_hazard(docs, df, id_field, metric='average')
Calculate severity per hazard
Used to calculate the severity for each hazard occurrence, as well as the average severity.
- Parameters:
docs (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
df (pandas DataFrame) – pandas datframe containing documents with severity per document already calculated
id_field (string) – the column in df that contains document ids
metric (string, optional) – whether to calculate the average or maximum severity, default is ‘average’
- Returns:
severities (Dict) – nested dictionary used to store severities per hazard occurrence. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists of severities.
total_severities_hazard (Dict) – Dictionary used to store average severity per hazard. Keys are hazards and value is average severity.
- mika.kd.trend_analysis.check_for_hazard_words(h_word, text)
Check for hazard words
Checks to see if a section of text contains a hazard word.
- Parameters:
h_word (string) – hazard word
text (string) – section of text being searched for hazard word.
- Returns:
hazard_found – true if hazard word appears in text, false if not.
- Return type:
boolean
- mika.kd.trend_analysis.check_for_negation_words(negation_words, text, h_word)
Check for negation words
Checks to see if any negation words appear within 3 words of a hazard word in a specified section of text.
- Parameters:
negation_words (list) – list of negation words.
text (string) – section of text being searched for hazard and negation words.
h_word (string) – hazard word
- Returns:
hazard_found – true if hazard word appears in text with no negation words, false if not i.e., a negation word is present.
- Return type:
boolean
- mika.kd.trend_analysis.chi_squared_tests(preprocessed_df, hazards, predictors, pred_dict={})
Chi squared tests
Performs chi-squared test for each predictor to determine if there is a statistically significant difference in the counts of the predictor between reports with and without each hazard
- Parameters:
preprocessed_df (pandas DataFrame) – pandas datframe containing documents.
hazards (list) – list of hazards
predictors (list) – list of columns in the dataframe with categorical predictors of interest.
pred_dict (dict, optional) – dictionary with predictors from predicotr list as keys and names to disply in the table as values, default is {}.
- Returns:
stats_df (pandas DataFrame) – pandas dataframe containing the chi-squared statistic and p-val for each hazard-predictor pair
count_dfs (Dict) – dictionary of pandas dataframes containing the counts for each predictor and hazard
- mika.kd.trend_analysis.corr_sig(df=None)
Correlation signature
Returns the probability matrix for a correlation matrix.
- Parameters:
df (Pandas Dataframe, optional) – df storing the data that is used to create the correlation matrix. rows are years, columns are predictors + hazard frequencies, by default None
- Returns:
p_matrix – numpy array with p_values for corresponding dataframe entries
- Return type:
numpy array
- mika.kd.trend_analysis.create_correlation_matrix(predictors_scaled, frequencies_scaled, graph=True, mask_vals=False, figsize=(6, 4), fontsize=12, save=False, results_path='', title=False)
Create correlation matrix
Creates the correlation matrix between all predictors and all hazard frequencies. All arguments are outputs from create_metrics_time_series.
- Parameters:
predictors_scaled (dict) – dictionary with keys as predictor names, values as a time series list of values scaled using minmax
frequencies_scaled (dict) – dictionary with keys as hazard names, values as times series list of frequencies scaled using minmax
graph (boolean, optional) – True to graph the data, false to not graph. The default is True.
mask_vals (boolen, optional) – True to mask values that are not significant. The default is False.
figsize (tuple, optional) – size of the plot in inches. The default is (6,4).
fontsize (int, optional) – fontsize for the plot. The default is 16.
save (boolean, optional) – True to save the graph. The default is False.
results_path (string, optional) – path to save the plot to. The default is “”.
title (Boolean, optional) – True to show a title on graph. The default is False.
- Returns:
corrMatrix (pandas DataFrame) – correlation matrix.
correlation_mat_total (pandas DataFrame) – stores the hazard and predictor values used for the correlation matrix
p_values (pandas DataFrame) – p-vals for each correlation.
- mika.kd.trend_analysis.examine_hazard_extraction_mismatches(preprocessed_df, true, pred, hazards, hazard_words_per_doc, topics_per_doc, hazard_topics_per_doc, id_col, text_col, results_path)
Examine hazard extraction mismatches
Used to examine which documents are mislabled by HEAT. used iteratively to refine hazard extraction.
- Parameters:
preprocessed_df (TYPE) – DESCRIPTION.
true (pandas DataFrame) – dataframe of the manually labled docs
pred (pandas DataFrame) – dataframe of the HEAT labled docs
hazards (list) – list of hazards
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.
topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers
hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics
id_col (string) – column in preprocessed_df containing document ids
text_col (string) – column in preprocessed_df containing text
results_path (string) – location to save the resulting datasheets to
- Returns:
dfs – dictionary with keys as hazards and values as pandas dataframes storing document mismatches
- Return type:
dict
- mika.kd.trend_analysis.get_doc_text(id_, temp_df, id_field, text_field)
Get document text
Gets the text for a document.
- Parameters:
id_ (string) – id of the specified document
temp_df (pandas DataFrame) – subset of preprocessed_df only containing documents associated with the specified hazard
id_field (string) – the column in preprocessed df that contains document ids
text_field (string) – the column in preprocessed df that stores the text. can be different from results_text_field, but is usually the same
- Returns:
text – the document text
- Return type:
string
- mika.kd.trend_analysis.get_doc_time(id_, temp_df, id_field, time_field)
Get document time
Gets the time value for a document. usually the year of the report, but could also be month or any other time value.
- Parameters:
id_ (string) – id of the specified document
temp_df (pandas DataFrame) – subset of preprocessed_df only containing documents associated with the specified hazard
id_field (string) – the column in preprocessed df that contains document ids
time_field (string) – the column in preprocessed df that contains document time values, such as report year
- Returns:
year – the time value for the specified document, usually a year
- Return type:
str
- mika.kd.trend_analysis.get_hazard_df(hazard_info, hazards, i)
Get hazard dataframe
Gets hazard information for a specified hazard.
- Parameters:
hazard_info (pandas DataFrame) – pandas dataframe with columns for hazard names, hazard words, and topics per hazard
hazards (list) – list of hazard names
i (int) – the index of the specified hazard in hazard list.
- Returns:
hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard
hazard_name (string) – name of the specified hazard
- mika.kd.trend_analysis.get_hazard_doc_ids(nums, results, results_text_field, docs, doc_topic_distribution, text_field, topic_thresh, preprocessed_df, id_field)
Get hazard document IDs
Gets the document ids associated with a specified hazard.
- Parameters:
nums (list) – list of topic numbers associated with the hazard
results (pandas DataFrame) – dataframe with topic modeling results generated using topic model plus
results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column
docs (list) – list of document ids.
doc_topic_distribution (pandas DataFrame) – dataframe containing topic distributions per document
text_field (string) – the column in preprocessed df that stores the text. can be different from results_text_field, but is usually the same
topic_thresh (float) – the probability threshold a document must have to be considered in a topic
preprocessed_df (pandas DataFrame) – pandas datframe containing documents
id_field (string) – the column in preprocessed df that contains document ids
- Returns:
temp_df (pandas DataFrame) – subset of preprocessed_df only containing documents associated with the specified hazard
ids_ (list) – list of document ids associated with the specified hazard
- mika.kd.trend_analysis.get_hazard_info(hazard_file)
Get hazard information
Loads hazard information from hazard spreadsheet.
- Parameters:
hazard_file (string) – filepath to hazard interpretation spreadsheet
- Returns:
hazard_info (pandas DataFrame) – pandas dataframe with columns for hazard names, hazard words, and topics per hazard
hazards (list) – list of hazard names
- mika.kd.trend_analysis.get_hazard_topics(hazard_df, begin_nums)
Get hazard topics
Gets topic numbers for a hazard from the hazard dataframe.
- Parameters:
hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard
begin_nums (int) – The topic index to begin at. is -1 if using bertopic since the top level ‘topic’ is really the cluster of documents not belonging to a topic. 0 otherwise.
- Returns:
nums – list of topic numbers associated with the hazard
- Return type:
list
- mika.kd.trend_analysis.get_hazard_topics_per_doc(ids, topics_per_doc, hazard_topics_per_doc, hazard_name, nums, begin_nums)
Get hazard topics per document
Gets the topics per document that are associated with a specied hazard.
- Parameters:
ids (list) – list of document ids associated with the specified hazard
topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers
hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topic. inner dictionary values are all empty lists for input variable.
hazard_name (string) – name of specified hazard
nums (list) – list of topic numbers associated with the hazard
begin_nums (int) – The topic index to begin at. is -1 if using bertopic since the top level ‘topic’ is really the cluster of documents not belonging to a topic. 0 otherwise.
- Returns:
hazard_topics_per_doc – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics
- Return type:
dict
- mika.kd.trend_analysis.get_hazard_words(hazard_df)
Get hazard words
Gets the hazard words for a specified hazard dataframe.
- Parameters:
hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard
- Returns:
hazard_words – list of hazard words
- Return type:
list
- mika.kd.trend_analysis.get_likelihood_FAA(rates)
FAA likelihood
Converts hazard rate of occurrence to an FAA likelihood category.
- Parameters:
rates (dict) – dictionary with hazard name as keys and a rate as a value
- Returns:
curr_likelihoods – dictionary with hazard name as keys and a likelihood category as a value
- Return type:
dict
- mika.kd.trend_analysis.get_likelihood_USFS(rates)
USFS likelihood
Converts hazard rate of occurrence to an USFS likelihood category.
- Parameters:
rates (dict) – dictionary with hazard name as keys and a rate as a value
- Returns:
curr_likelihoods – dictionary with hazard name as keys and a likelihood category as a value
- Return type:
dict
- mika.kd.trend_analysis.get_negation_words(hazard_df)
Get negation words
Gets the negation words for a specified hazard dataframe
- Parameters:
hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard
- Returns:
negation_words – list of negation words
- Return type:
list
- mika.kd.trend_analysis.get_results_info(results_file, results_text_field, text_field, doc_topic_dist_field)
Get results information
Pulls topic modeling results from results spreadsheet generated with topic model plus.
- Parameters:
results_file (string) – filepath to results spreadsheet
results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column
text_field (string) – the text field of interest in the preprocessed_df. sometimes it is different from results_text_field but it is usually the same.
doc_topic_dist_field (string or None) – the column storing the topic distribution per document information. Can be ommitted, only used when a user wants to filter results so a document only belings to a topic if the probability is above a specified threshold.
- Returns:
results (pandas DataFrame) – dataframe with topic modeling results generated using topic model plus
results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column
doc_topic_distribution (pandas DataFrame) – dataframe containing topic distributions per document
begin_nums (int) – The topic index to begin at. is -1 if using bertopic since the top level ‘topic’ is really the cluster of documents not belonging to a topic. 0 otherwise.
- mika.kd.trend_analysis.get_topics_per_doc(docs, results, results_text_field, hazards)
Get topics per document
Finds the topics associated with each document.
- Parameters:
docs (list) – list of document ids.
results (pandas DataFrame) – dataframe with topic modeling results generated using topic model plus
results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column
hazards (list) – list of hazards
- Returns:
topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers
hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics
- mika.kd.trend_analysis.get_word_frequencies(hazard_words_per_doc, hazards_sorted=None)
Word frequencies
Calculate word frequencies.
- Parameters:
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.
hazards_sorted (list, optional) – ordered list of hazards for generating frequencies. The default is None.
- Returns:
word_frequencies – nested dictionary where keys are hazards. inner dictionary has words as keys and word frequencies as values.
- Return type:
dictionary
- mika.kd.trend_analysis.hazard_accuracy(docs_per_hazard, num, results_path, hazard_words_per_doc, preprocessed_df, text_col, id_col, seed=0)
Hazard accuracy
Creates a data sheet to calculate hazard extraction acccuracy by randomly sampling documents for each hazard. This method actually is calculating precision at k, k=num. Note that this method is not the prefered method to evaluate hazard accuracy, Instead a user should use the classification metrics and label a validation set.
- Parameters:
docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
num (int) – number of documents to sample for each hazard.
results_path (string) – filepath to topic modelresults spreadsheet
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.
preprocessed_df (pandas DataFrame) – pandas datframe containing documents
text_col (string) – column in preprocessed_df containing text
id_col (string) – column in preprocessed_df containing document ids
seed (int, optional) – seed for random sampling. The default is 0.
- Returns:
sampled_hazard_ids (dict) – dictionary with hazards as keys and a list of document ids for values
total_ids (list) – list of all the ids of documents belonging to any hazard
- mika.kd.trend_analysis.identify_docs_per_fmea_row(df, grouping_col, year_col, id_col)
Identify documents per FMEA row
Identifies the documents corresponding to each row in an FMEA.
- Parameters:
df (pandas DataFrame) – pandas datframe containing documents.
grouping_col (string) – the column in df that is used to group documents into FMEA rows
year_col (string) – the column in df that contains document time values, such as report year
id_col (string) – the column in df that contains document ids
- Returns:
frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.
docs_per_row (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
- mika.kd.trend_analysis.identify_docs_per_hazard(hazard_file, preprocessed_df, results_file, text_field, time_field, id_field, results_text_field=None, doc_topic_dist_field=None, topic_thresh=0.0)
Identify documents per hazard
Outputs the documents per hazard.
- Parameters:
hazard_file (string) – filepath to hazard interpretation spreadsheet
preprocessed_df (pandas DataFrame) – pandas datframe containing documents.
results_file (string) – filepath to results spreadsheet
text_field (string) – the text field of interest in the preprocessed_df. sometimes it is different from results_text_field but it is usually the same.
time_field (string) – the column in preprocessed df that contains document time values, such as report year
id_field (string) – the column in preprocessed df that contains document ids
results_text_field (string, optional) – column in result dataframe where topic numbers are stored for the specified text column The default is None.
doc_topic_dist_field (string or None, optional) – the column storing the topic distribution per document information. Can be ommitted, only used when a user wants to filter results so a document only belings to a topic if the probability is above a specified threshold. The default is None.
topic_thresh (float, optional) – the probability threshold a document must have to be considered in a topic. The default is 0.0.
- Returns:
frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.
docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.
topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers
hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics
- mika.kd.trend_analysis.make_pie_chart(docs, data, predictor, hazards, id_field, predictor_label=None, save=True, results_path='', pie_kwargs={}, fontsize=16, figsize=(17, 9), padding=5, legend_kwargs={})
Make pie chart
Makes a set of pie charts, with one pie chart per hazard showing the distribution of the categorical predictor variable specified.
- Parameters:
docs (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
data (pandas DataFrame) – pandas datframe containing documents.
predictor (string) – column in the data that has the categorical predictor of interest
hazards (list) – list of hazards
id_field (string) – the column in preprocessed df that contains document ids
predictor_label (string, optional) – predictor label to be shown in the figure title, by default None
save (bool, optional) – True to save the figure, by default True
results_path (string, optional) – path to save figure to. The default is “”.
pie_kwargs (Dict, optional) – Dictionary to pass kwargs into the piechart, default an empty dictionary
fontsize (int, optional) – fontsize for the plot. The default is 16.
figsize (tuple, optional) – size of the figure in inches. The default is (17, 9)
padding (int, optional) – the padding between graphs. The deafult is 5
legend_kwargs (dict, optional) – dictionary to pass in legend options
- mika.kd.trend_analysis.minmax_scale(data_list)
Minmax scale
Performs minmax scaling on a single data list in order to normalize the data. normalization is required prior to regression and ML. Also it is used for graphing multiple time series on the same axes.
- Parameters:
data_list (list) – list of numerical data that will be scaled using minmax scaling
- Returns:
scaled_list – list of data scaled between 0-1
- Return type:
list
- mika.kd.trend_analysis.multiple_reg_feature_importance(predictors, hazards, correlation_mat_total, save=False, results_path='', r2_fontsize=10, r2_figsize=(3.5, 4), predictor_import_fontsize=10, predictor_import_figsize=(7, 4))
Multiple regression feature importance
Builds multiple regression model for hazrd frequency given the predictors. Also performs predictor importance to identify which predictors are most relevant to the hazards frequency. Predictors and hazards must be values from the correlation_mat_total.
- Parameters:
predictors (list) – list of predictor names, used to identify inputs to multiple regression
hazards (list) – list of hazard names, used to identify targets for multiple regression
correlation_mat_total (dataframe) – stores the time series values that were used for correlation matrix. rows are years, columns are predictors + hazard frequencies
save (bool, optional) – True to save resulting figure, by default False
results_path (str, optional) – _description_, by default “”
r2_fontsize (int, optional) – Frontsize for r2 figure, by default 10
r2_figsize (tuple, optional) – figsize for r2 figure, by default (3.5,4)
predictor_import_fontsize (int, optional) – fontsize for predictor importancce figure, by default 10
predictor_import_figsize (tuple, optional) – figsize for predictor importance figure, by default (7,4)
- Returns:
results_df (Pandas Dataframe) – dataframe containing both the mse and r2 scores for the model
delta_df (Pandas Dataframe) – dataframe containing the difference between the total model and the model when the predictor is shuffled
coefficient_df (Pandas Dataframe) – dataframe containing the regression coefficients for predictor importance
- mika.kd.trend_analysis.plot_USFS_risk_matrix(likelihoods, severities, figsize=(9, 5), save=False, results_path='', fontsize=12, max_chars=20, title=False)
Plot USFS risk matrix
Plots a USFS risk matrix from likelihood and severity categories.
- Parameters:
likelihoods (dict) – dictionary with hazard name as keys and a likelihood category as a value
severities (dict) – dictionary with hazard name as keys and a severity category as a value
figsize (tuple, optional) – figure size in inches. The default is (9,5).
save (boolean, optional) – true to save the figure. The default is False.
results_path (string, optional) – path to save figure to. The default is “”.
fontsize (int, optional) – figure fontsize. The default is 12.
max_chars (int, optional) – maximum characters per line in a cell of the risk matrix. used to improve readability and ensure hazard names are contained in a cell. The default is 20.
title (boolean, optional) – Dtrue to show title. The default is False.
- Return type:
None.
- mika.kd.trend_analysis.plot_frequency_time_series(metric_data, metric_name='Frequency', line_styles=[], markers=[], title='', time_name='Year', xtick_freq=5, scale=True, save=False, results_path='', yscale=None, legend=True, figsize=(6, 4), fontsize=16, interval=False, interval_kwargs={'false_neg_rate': 0.05, 'false_pos_rate': 0.05}, legend_kwargs={})
Plot frequency time series
Plots hazard frequency over time. Different from plot metric time series because of input data
- Parameters:
metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and frequency count as an integer for values.
metric_name (string, optional) – name of metric. The default is ‘Frequency’.
line_styles (list, optional) – list of line styles to use. should have one value for each hazard. The default is [].
markers (list, optional) – list of line markers to use. should have one value for each hazard. The default is [].
title (string, optional) – title to add to plot. The default is “”.
time_name (string, optional) – name of the time interval used to label the x axis. The default is “Year”.
xtick_freq (int, optional) – the number of values per x tick, e.g., the default would go 2015, 2020, 2025. The default is 5.
scale (boolean, optional) – true to minmax scale data, false to use raw data. The default is False.
save (boolean, optional) – true to save the plot as a pdf. The default is False.
results_path (string, optional) – path to save figure to. The default is “”.
yscale (string, optional) – yscale parameter, can be used to change scaling to log. The default is None.
legend (boolean, optional) – true to show legend, false to hide legend. The default is True.
figsize (tuple, optional) – size of the plot in inches. The default is (6,4).
fontsize (int, optional) – fontsize for the plot. The default is 16.
interval (bool, optional) – true to contstruct an interval for the metric based on false pos/neg rate. The default is False.
interval_kwargs (dict, optional) – dictionary to pass in the false positive and false negative rate for each hazard. The default is {‘false_pos_rate’:0.05, ‘false_neg_rate’:0.05}.
legend_kwargs (dict, optional) – dictionary to pass in legend options
- Return type:
None.
- mika.kd.trend_analysis.plot_metric_averages(metric_data, metric_name, show_std=True, title='', save=False, results_path='', yscale=None, legend=True, figsize=(6, 4), fontsize=16, error_bars='stddev')
Plot metric averages
Plots metric averages as a barchart.
- Parameters:
metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and list of metrics for values.
metric_name (string) – name of metric, e.g., severity, used for axis and saving the figure
show_std (boolean, optional) – true to show std deviation on time series as error bars. The default is True.
title (string, optional) – title to add to plot. The default is “”.
save (boolean, optional) – true to save the plot as a pdf. The default is False.
results_path (string, optional) – path to save figure to. The default is “”.
yscale (string, optional) – yscale parameter, can be used to change scaling to log. The default is None.
legend (boolean, optional) – true to show legend, false to hide legend. The default is True.
figsize (tuple, optional) – size of the plot in inches. The default is (6,4).
fontsize (int, optional) – fontsize for the plot. The default is 16.
error_bars (string) – type of error bars to use, can be ‘stddev’ or ‘CI’. The default is ‘stddev’.
- Return type:
None.
- mika.kd.trend_analysis.plot_metric_time_series(metric_data, metric_name, line_styles=[], markers=[], title='', time_name='Year', scaled=False, xtick_freq=5, show_std=True, save=False, results_path='', yscale=None, legend=True, figsize=(6, 4), fontsize=16, bootstrap=False, bootstrap_kwargs={'CI_interval': 95, 'metric_percentages': 1, 'num_means': 1000}, legend_kwargs={})
Plot metric time series
Plots a time series for specified metrics for all hazards (i.e., line chart).
- Parameters:
metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and list of metrics for values.
metric_name (string) – name of metric, e.g., severity, used for axis and saving the figure
line_styles (list, optional) – list of line styles to use. should have one value for each hazard. The default is [].
markers (list, optional) – list of line markers to use. should have one value for each hazard. The default is [].
title (string, optional) – title to add to plot. The default is “”.
time_name (string, optional) – name of the time interval used to label the x axis. The default is “Year”.
scaled (boolean, optional) – true to minmax scale data, false to use raw data. The default is False.
xtick_freq (int, optional) – the number of values per x tick, e.g., the default would go 2015, 2020, 2025. The default is 5.
show_std (boolean, optional) – true to show std deviation on time series as error bars. The default is True.
save (boolean, optional) – true to save the plot as a pdf. The default is False.
results_path (string, optional) – path to save figure to. The default is “”.
yscale (string, optional) – yscale parameter, can be used to change scaling to log. The default is None.
legend (boolean, optional) – true to show legend, false to hide legend. The default is True.
figsize (tuple, optional) – size of the plot in inches. The default is (6,4).
fontsize (int, optional) – fontsize for the plot. The default is 16.
bootstrap (Boolean, optional) – true to bootstrap the data for a confidence interval. The default is False.
bootstrap_kwargs (dict, optional) – dictionary to pass bootstrapping parameters
legend_kwargs (dict, optional) – dictionary to pass in legend options
- Return type:
None.
- mika.kd.trend_analysis.plot_predictors(predictors, predictor_labels, time, time_label='Year', title='', totals=True, averages=True, scaled=True, figsize=(12, 5), axs=[], fig=None, show=False, legend=True, legend_kwargs={})
Plot predictors
Plots predictor timeseries.
- Parameters:
predictors (list) – list of predictors
predictor_labels (list) – list of labels for the predictors which will be shown on the graph
time (list) – list of time values tat define the axis/time series.
time_label (string, optional) – label for the time values. The default is ‘Year’.
title (string, optional) – figure title. The default is “”.
totals (boolean, optional) – true to graph total or sum values. The default is True.
averages (boolean, optional) – true to graph average values. The default is True.
scaled (boolean, optional) – true to minmax scale the timeseries data. The default is True.
figsize (tuple, optional) – figure size in inches. The default is (12, 5).
axs (matplotlib axs object, optional) – used to plot multiple graphs on one figure. The default is [].
fig (matplotlib fig object, optional) – used to plot multiple graphs on one figure. The default is None.
show (boolean, optional) – true to show figure, false to return graph objects. The default is False.
legend (boolean, optional) – true to show legend. The default is True.
legend_kwargs (dict, optional) – dictionary to pass in legend options
- Returns:
fig (matplotlib fig object) – the figure object
axs (matplotlib axs object) – axs object
- mika.kd.trend_analysis.plot_risk_matrix(likelihoods, severities, figsize=(9, 5), save=False, results_path='', fontsize=12, max_chars=20, annot_font=12)
Plot risk matrix
Plots a FAA risk matrix from likelihood and severity categories.
- Parameters:
likelihoods (dict) – dictionary with hazard name as keys and a likelihood category as a value
severities (dict) – dictionary with hazard name as keys and a severity category as a value
figsize (tuple, optional) – figure size in inches. The default is (9,5).
save (boolean, optional) – true to save the figure. The default is False.
results_path (string, optional) – path to save figure to. The default is “”.
fontsize (int, optional) – figure fontsize. The default is 12.
max_chars (int, optional) – maximum characters per line in a cell of the risk matrix. used to improve readability and ensure hazard names are contained in a cell. The default is 20.
annot_font (int, optional) – figure annotation fontsize. The default is 12.
- Return type:
None.
- mika.kd.trend_analysis.proposed_topics(lists=[])
Proposed topics
Experimental function to identify topics that may be relevent to specified hazards based on manually labeled data.
- Parameters:
lists (list, optional) – list of lists. inner lists are topic numbers for each document manually labeled as associated with a hazard. The default is [].
- Returns:
proposed_topics – list of proposed new topics.
- Return type:
list
- mika.kd.trend_analysis.record_hazard_doc_info(hazard_name, year, docs_per_hazard, id_, frequency, hazard_words_per_doc, docs, h_word)
Record hazard document information
Saves the information for a specified document that contains a specified hazard.
- Parameters:
hazard_name (string) – name of specified hazard
year (int or str) – year that the report occurs in
docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
id_ (string) – id of the specified document
frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.
docs (list) – list of document ids.
h_word (string) – hazard word
- Returns:
docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.
- mika.kd.trend_analysis.regression_feature_importance(predictors, hazards, correlation_mat_total)
Regression feature importance
Performs feature importance for a set of linear regression analyses.
- Parameters:
predictors (list) – list of predictor names, used to identify inputs to single linear regression
hazards (list) – list of hazard names, used to identify targets for single linear regression
correlation_mat_total (dataframe) – stores the time series values that were used for correlation matrix. rows are years, columns are predictors + hazard frequencies
- Returns:
results_df – dataframe containing both the mse and r2 scores for the model
- Return type:
pandas dataframe
- mika.kd.trend_analysis.remove_outliers(data, threshold=1.5, rm_outliers=True)
Remove outliers
Removes outliers from the dataset using inter quartile range.
- Parameters:
data (list) – list of data points
threshold (float, optional) – Threshold for the distance outside of the interquartile range that defines an outlier. The default is 1.5.
rm_outliers (Booleam, optional) – True to remove outliers, false to return original data. The default is True.
- Returns:
new_data – list of the data points with outliers removed
- Return type:
list
- mika.kd.trend_analysis.reshape_correlation_matrix(corrMatrix, p_values, predictors, hazards, figsize=(8, 8.025), fontsize=16)
Reshape correlation matrix
Reshapes the correlation matrix between all predictors and all hazard frequencies. Columns are predictors and rows are hazards. Arguments are outputs from create_correlation_matrix.
- Parameters:
corrMatrix (pandas DataFrame) – correlation matrix.
p_values (pandas DataFrame) – p-vals for each correlation.
predictors (list) – list of predictors
hazards (list) – list of hazards
figsize (tuple, optional) – size of the plot in inches. The default is =(8,8.025).
fontsize (int, optional) – fontsize for the plot. The default is 16.
- Return type:
None.
- mika.kd.trend_analysis.sample_for_accuracy(preprocessed_df, id_col, text_col, hazards, save_path, num_sample=100)
Sample for accuracy
Generates a spreadsheet of randomly sampled documents to analyze the quality of hazard extraction.
- Parameters:
preprocessed_df (pandas DataFrame) – pandas datframe containing documents
id_col (string) – column in preprocessed_df containing document ids
text_col (string) – column in preprocessed_df containing text
hazards (list) – list of hazards
save_path (string) – location to save the file to
num_sample (int, optional) – number of documents to sample. The default is 100.
- Returns:
sampled_df – dataframe of sampled documents
- Return type:
pandas DataFrame
- mika.kd.trend_analysis.set_up_docs_per_hazard_vars(preprocessed_df, id_field, hazards, time_field)
Set up documents per hazard variables
Instantiates variables used to find the documents per hazard.
- Parameters:
preprocessed_df (pandas DataFrame) – pandas datframe containing documents
id_field (string) – the column in preprocessed df that contains document ids
hazards (list) – list of hazards
time_field (string) – the column in preprocessed df that contains document time values, such as report year
- Returns:
docs (list) – list of document ids.
frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.
docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.
hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.