kd package

mika.kd.FMEA

class mika.kd.FMEA

A class used for generated FMEAs from a data set of reports.

build_fmea(severity_func, group_by, year_col, group_by_kwargs={}, post_process_kwargs={}, save=True)

Builds the FMEA using the above functions, all in one call. Less customizable, but useful for a quick implementation.

Parameters:
  • severity_func (function) – DESCRIPTION.

  • group_by (‘string’) – method to group together the FMEA rows, either manual file or by meta data.

  • year_col (string) – The column the year for the report is stored in.

  • group_by_kwargs (dict, optional) – dictionary containing all inputs for the group_by function. The default is {}.

  • post_process_kwargs (dict, optional) – dictionary containing all inputs for the post_process_fmea function. The default is {}.

Return type:

None.

calc_frequency(year_col)

Calculates the frequency for each row and assigns it a category

Parameters:

year_col (string) – The column the year for the report is stored in.

Returns:

grouped_df – Grouped df with requency column added

Return type:

DataFrame

calc_risk()

Calculates risk as the product of severity and frequency. Adds risk column to the grouped df.

Return type:

None.

calc_severity(severity_func, from_file=False, file_name='', file_kwargs={})

Calculates the severity for each row according to a defined severity function.

Parameters:
  • severity_func (function) – User defined function for calculating severity. Usually a linear combination of other values.

  • from_file (Boolean, optional) – True if the severity value is already stored in a file, false if calculated from a severiy function. The default is False.

  • file_name (string, optional) – filepath to a spread sheet containing the severity value for each document. The default is ‘’.

  • file_kwargs (dict, optional) – any kwargs needed to read the file. Typically needed for .xlsx workbooks with multiple sheets. The default is {}.

Return type:

None.

display_doc(doc_id, save=True, output_path='', colors_path=None, pred=True)

Displays an annotated document with entities highlighted accordingly

Parameters:
  • doc_id (string) – The id of the document to be dispalyed

  • save (Boolean, optional) – Saves as html if true. Displays if False. The default is True.

  • output_path (string, optional) – The filepath the display will be saved to. The default is “”.

  • colors_path (string, optional) – The path to a file that defines the colors to be used for each entity. The default is None.

  • pred (Boolean, optional) – True if the displayed document is from predictions, False if from manual annotations. The default is True.

Returns:

html – DESCRIPTION.

Return type:

TYPE

evaluate_preds(cm=True, class_report=True)

Can only be used if the input data is labeled. Evaluates the performance of the NER model against labeled data.

Parameters:
  • cm (Boolean, optional) – Creates a confusion matrix if True. The default is True.

  • class_report (Boolean, optional) – Creates a classification report if True. The default is True.

Returns:

return_vals – Dict containing confusion matrix and classification report if specified.

Return type:

Dictionary

get_entities_per_doc(pred=True)

Gets all entites for each document. Note that this is required because the NER model is run on sentences. This function reconstructs the documents from the sentences, while preserving the entities.

Parameters:

pred (Boolean, optional) – True if the entities per doc are from predicted entities. False if the entities per doc are from labels. The default is True.

Returns:

pandas DataFrame with documents as rows, entities as columns

Return type:

data_df

get_year_per_doc(year_col, config='/')

Used to convert dates to years prior to calculating frequency

Parameters:
  • year_col (string,) – The colomn in the raw dataframe with the date information.

  • config (TYPE, optional) – DESCRIPTION. The default is ‘/’.

Return type:

None.

group_docs_manual(filename, grouping_col, additional_cols=[], sample=1)

Creates FMEA rows by grouping together documents according to values manually defined in a separate file. Loads in the file and then aggregates the data. Sample IDs for documents in each row are created as well.

Parameters:
  • filename (string) – filepath to the spreadsheet defining the rows.

  • grouping_col (string) – The column within the spreadsheet that defines the rows.

  • additional_cols (list, optional) – Additional columns to include in the FMEA.

  • sample (int, optional) – Number of samples to pull for each FMEA row. The default is 1.

Returns:

grouped_df – The grouped FMEA dataframe

Return type:

DataFrame

group_docs_with_meta(grouping_col, additional_cols=[], sample=1)

Groups documents into an FMEA using a grouping column, which is metadata from the initial dataset.

Parameters:
  • grouping_col (string) – The column in the original dataset used to group documents into FMEA rows.

  • additional_cols (list of strings, optional) – additional columns in a dataset to include in the FMEA. The default is [].

  • sample (int, optional) – Number of samples to pull for each FMEA row. The default is 1.

Returns:

grouped_df – The grouped FMEA dataframe

Return type:

DataFrame

load_data(text_col, id_col, filepath='', df=None, formatted=False, label_col='labels')

Loads data to prepare for FMEA extraction. Sentence tokenization is performed for preprocessing, and the raw data is also saved. Can input a filepath for a .jsonl (annotations from doccano) or .csv file. Can also instead input a dataframe already loaded in. Can also instead input a huggingface dataset object location. Saves data formatted to input into the NER model. Requires spacy en_core_web_trf downloaded.

Parameters:
  • text_col (string) – The column where the text used for FMEA extraction is stored.

  • id_col (string) – The id column in the dataframe.

  • filepath (string, optional) – Can input a filepath for a .jsonl (annotations from doccano) or .csv file. The default is ‘’.

  • df (pandas DataFrame, optional) – Can instead input a pandas DataFrame already loaded in, with one column of text. The default is None.

  • formatted (Bool, optional) – True if the input in filepath is a formatted dataset object. The default is False.

  • label_col (string, optional) – The column containing annotation labels if the data is annotated. The default is “labels”.

Return type:

None.

load_model(model_checkpoint='NASA-AIML/MIKA_BERT_FMEA_NER', device=-1)

Loads in a fine-tuned custom NER model trained to extract FMEA entities. If no checkpoint is passed, the custom model from MIKA is used.

Parameters:
  • model_checkpoint (string, optional) – model check point, can be from huggingface or a path from personal device

  • device (int, optional) – device to run model on. Default is -1 to run on cpu

Return type:

None.

post_process_fmea(id_name='ID', phase_name='Mission Type', max_words=20)

Post processes the FMEA to identify the column that contains the phase name, clean sub-word tokens, and limit the number of words per cell.

Parameters:
  • id_name (string, optional) – Name of dataset used/name over the id column. The default is ‘SAFECOM’.

  • phase_name (string, optional) – Column that can be used to find the phase of operation. The default is ‘Mission Type’.

  • max_words (int, optional) – Maximum number of words in a cell in the FMEA. The default is 20.

Returns:

fmea_df – FMEA post processed DataFrame

Return type:

DataFrame

predict()

Performs named entity recognition on the input data.

Returns:

Predicted entities per each document

Return type:

Preds

mika.kd.Topic_Model_plus

class mika.kd.Topic_Model_plus(text_columns=[], data=None, ngrams=None, results_path='')

Topic model plus

A class for topic modeling for aviation safety.

Variables:
  • text_columns (list) – defines various columns within a single database which will be used for topic modeling

  • data (Data) – Data object storing the text corpus

  • ngrams (str) – ‘tp’ if the user wants tomotopy to form ngrams prior to applying a topic model

  • doc_ids (list) – list of document ids pulled from data object

  • data_df (pandas dataframe) – df storing documents pulled from data object

  • data_name (string) – dataset name pulled from data object

  • id_col (string) – the column storing the document ids pulled from data object

  • hlda_models (dictionary) – variable for storing hlda models

  • lda_models (dictionary) – variable for storing lda models

  • folder_path (string) – destination for storing results and models

  • results_path (string) – destentation to create results folder in

bert_topic(sentence_transformer_model=None, umap=None, hdbscan=None, count_vectorizor=None, ngram_range=(1, 3), BERTkwargs={}, from_probs=True, thresh=0.01)

Train a bertopic model.

Parameters:
  • sentence_transformer_model (BERT model object, optional) – BERT model object used for embeddings. The default is None.

  • umap (umap model object, optional) – umap model object used for dimensionality reduction. The default is None.

  • hdbscan (hdbscan model object, optional) – hdbscan model object for clustering. The default is None.

  • count_vectorizor (count vectorizor object, optional) – count vectorizor object that is used for ctf-idf. The default is None.

  • ngram_range (tuple, optional) – range of ngrams to be considered. The default is (1,3).

  • BERTkwargs (dict, optional) – dictionary of kwargs passed into bertopic. The default is {}.

  • from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.

  • thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.

Return type:

None.

calc_bert_coherence(docs, topics, topic_model, method='u_mass', num_words=10)

Calculates coherence for a bertopic model using gensim coherence models.

Parameters:
  • docs (list) – List of document text.

  • topics (List) – List of topics per document.

  • topic_model (BERTopic model object) – Object containing the trained topic model.

  • method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.

  • num_words (int, optional) – Number of words in the topic used to calculate coherence. The default is 10.

Returns:

coherence_per_topic – List of coherence scores for each topic.

Return type:

List

coherence_scores(mdl, lda_or_hlda, measure='c_v')

Computes and returns coherence scores for lda and hlda models.

Parameters:
  • mdl (lda or hlda model object) – topic model object created previously

  • lda_or_hlda (str) – denotes whether coherence is being calculated for lda or hlda

  • measure (string, optional) – denotes which coherence metric to compute. The default is ‘c_v’.

Returns:

scores – coherence scores, averages, and std dev

Return type:

dict

get_bert_coherence(coh_method='u_mass', from_probs=False)

Gets coherence for bert models and saves it in a dictionary.

Parameters:
  • coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.

  • from_probs (boolean, optional) – Whether or not to use document topic probabilities to assign topics. True to use probabilities - i.e., each document can have multiple topics. False to not use probabilities - i.e., each document only has one topics. The default is False.

Return type:

None.

get_bert_topic_diversity(topk=10)

Gets topic diversity scores for a BERTopic model.

Parameters:

topk (int, optional) – Number of words per topic used to calculate diversity. The default is 10.

Return type:

None.

get_bert_topics_from_probs(topic_df, thresh=0.01, coherence=False)

Saves topic model results including each topic number, words, number of words, and best document when document topics are defined by a probabilty threshold.

Parameters:
  • topic_df (dictionary of dataframes) – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results

  • thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.

  • coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.

Returns:

new_topic_dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results

Return type:

dictionary of dataframes

hdp(training_iterations=1000, iteration_step=10, to_lda=True, kwargs={}, topic_threshold=0.0)

Performs HDP topic modeling which is useful when the number of topics is not known.

Parameters:
  • training_iterations (int, optional) – number of training iterations. The default is 1000.

  • iteration_step (int, optional) – number of steps per iteration. The default is 10.

  • to_lda (boolean, optional) – True to convert the hdp model to an lda model. The default is True.

  • kwargs (dict, optional) – kwargs to pass into the hdp model. The default is {}.

  • topic_threshold (float, optional) – probability threshold used when converting hdp topics to lda topics. The default is 0.0.

Return type:

None.

hlda(levels=3, training_iterations=1000, iteration_step=10, **kwargs)

Performs hlda topic modeling.

Parameters:
  • levels (int, optional) – number of hierarchical levels. The default is 3.

  • training_iterations (int, optional) – number of training iterations. The default is 1000.

  • iteration_step (int, optional) – number of steps per iteration. The default is 10.

  • **kwargs (dict) – any kwargs for the hlda topic model

Return type:

None.

hlda_display(col, num_words=5, display_options={'level 1': 1, 'level 2': 6}, colors='bupu', filename='')

Saves graphviz visualization of hlda tree structure.

Parameters:
  • col (string) – column of interest

  • num_words (int, optional) – number of words per node. The default is 5.

  • display_options (dictionary, optional) – nested dictiary where keys are levels and values are the max number of nodes. The default is {“level 1”: 1, “level 2”: 6}.

  • colors (string, optional) – brewer colorscheme used, default is blue-purple see http://graphviz.org/doc/info/colors.html#brewer for options. The default is ‘bupu’.

  • filename (string, optional) – can input a filename for where the topics are stored in order to make display after hlda; must be an ouput from “save_hlda_topics()” or hlda.bin object. The default is ‘’.

Return type:

None.

hlda_extract_models(file_path)

Gets hlda models from file.

Parameters:

file_path (string) – path to file

Return type:

None.

hlda_visual(col)

Saves pyLDAvis output from hlda to file.

Parameters:

col (str) – reference to column of interest

Return type:

None.

label_hlda_topics(extractor_min_cf=5, extractor_min_df=3, extractor_max_len=5, extractor_max_cand=5000, labeler_min_df=5, labeler_smoothing=0.01, labeler_mu=0.25, label_top_n=3)

Uses tomotopy’s auto topic labeling tool to label topics. Stores labels in class; after running this function, a flag can be used to use labels or not in taxonomy saving functions.

Parameters:
  • extractor_min_cf (int) – from tomotopy docs: “minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big”

  • extractor_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”

  • extractor_max_len (int) – from tomotopy docs: “maximum length of collocations”

  • extractor_max_cand (int) – from tomotopy docs: “maximum number of candidates to extract”

  • labeler_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”

  • labeler_smoothing (float) – from tomotopy docs: “a small value greater than 0 for Laplace smoothing”

  • labeler_mu (float) – from tomotopy docs: “a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get the higher final score when this value is the larger.”

  • label_top_n (int) – from tomotopy docs: “the number of labels”

label_lda_topics(extractor_min_cf=5, extractor_min_df=3, extractor_max_len=5, extractor_max_cand=5000, labeler_min_df=5, labeler_smoothing=0.01, labeler_mu=0.25, label_top_n=3)

Uses tomotopy’s auto topic labeling tool to label topics. Stores labels in class; after running this function, a flag can be used to use labels or not in taxonomy saving functions.

Parameters:
  • extractor_min_cf (int) – from tomotopy docs: “minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big”

  • extractor_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”

  • extractor_max_len (int) – from tomotopy docs: “maximum length of collocations”

  • extractor_max_cand (int) – from tomotopy docs: “maximum number of candidates to extract”

  • labeler_min_df (int) – from tomotopy docs: “minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big”

  • labeler_smoothing (float) – from tomotopy docs: “a small value greater than 0 for Laplace smoothing”

  • labeler_mu (float) – from tomotopy docs: “a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get the higher final score when this value is the larger.”

  • label_top_n (int) – from tomotopy docs: “the number of labels”

lda(num_topics={}, training_iterations=1000, iteration_step=10, max_topics=0, **kwargs)

Performs LDA topic modeling.

Parameters:
  • num_topics (dict, optional) – keys are columns in text_columns, values are the number of topics lda forms optional - if omitted, lda optimization is run and produces the num_topics The default is {}.

  • training_iterations (int, optional) – number of training iterations. The default is 1000.

  • iteration_step (int, optional) – number of steps per iteration. The default is 10.

  • max_topics (int, optional) – maximum number of topics to consider The default is 0.

  • **kwargs (dict) – any kwargs for the lda topic model.

Return type:

None.

lda_extract_models(file_path)

Loads lda models from file.

Parameters:

file_path (str) – path to file

Return type:

None.

lda_visual(col)

Saves pyLDAvis output from lda to file.

Parameters:

col (str) – reference to column of interest

Return type:

None.

load_bert_model(file_path, reduced=False, from_probs=False, thresh=0.01)

Loads trained bertopic model(s).

Parameters:
  • file_path (string) – file path to the folder storing the model(s)

  • reduced (bool, optional) – True if the model is a reduced topic model, by default False

  • from_probs (bool, optional) – Whether or not to use document topic probabilities to assign topics. True to use probabilities - i.e., each document can have multiple topics. False to not use probabilities - i.e., each document only has one topics, by default False

  • thresh (float, optional) – probability threshold used when from_probs=True, by default 0.01

reduce_bert_topics(num=30, from_probs=False, thresh=0.01)

Reduces the number of topics in a trained bertopic model to the specified number.

Parameters:
  • num (int optional) – number of topics in the reduced model. The default is 30.

  • from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.

  • thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.

Return type:

None.

save_bert_coherence(return_df=False, coh_method='u_mass', from_probs=False)

Saves the coherence scores for a bertopic model.

Parameters:
  • return_df (boolean, optional) – True to return the coherence_df. The default is False.

  • coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.

  • from_probs (boolean, optional) – Whether or not to use document topic probabilities to assign topics. True to use probabilities - i.e., each document can have multiple topics. False to not use probabilities - i.e., each document only has one topics. The default is False.

Returns:

coherence_df – Dataframe with each row a topic, column has coherence scores.

Return type:

pandas DataFrame

save_bert_document_topic_distribution(return_df=False)

Saves the document topic distribution.

Parameters:

return_df (boolean, optional) – True to return the results df. The default is False.

Returns:

doc_df – dataframe with a row for each document and the probability for each topic

Return type:

pandas DataFrame

save_bert_model(embedding_model=True)

Saves a BERTopic model.

Parameters:

embedding_model (boolean, optional) – True to save the embedding model. The default is True.

Return type:

None.

save_bert_results(coherence=False, coh_method='u_mass', from_probs=False, thresh=0.01, topk=10)

Saves the taxonomy, coherence, and document topic distribution in one excel file.

Parameters:
  • coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.

  • coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.

  • from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is False.

  • thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.

  • topk (int, optional) – Number of words per topic used to calculate diversity. The default is 10.

Return type:

None.

save_bert_taxonomy(return_df=False, p_thres=0.0001)

Saves a taxonomy of topics from bertopic model.

Parameters:
  • return_df (boolean, optional) – True to return the results dfs. The default is False.

  • p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.0001.

Returns:

taxonomy_df – taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents

Return type:

pandas Dataframe

save_bert_topic_diversity(topk=10, return_df=False)

Saves topic diversity score for a bertopic model.

Parameters:
  • topk (int, optional) – Number of words per topic used to calculate diversity. The default is 10.

  • return_df (boolean, optional) – True to return the diversity_df. The default is False.

Returns:

diversity_df – Dataframe with topic diversity score.

Return type:

pandas DataFrame

save_bert_topics(return_df=False, p_thres=0.0001, coherence=False, coh_method='u_mass', from_probs=False)

Saves bert topics results to file.

Parameters:
  • return_df (boolean, optional) – True to return the results dfs. The default is False.

  • p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.0001.

  • coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.

  • coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.

  • from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.

Returns:

dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results

Return type:

dictionary of dataframes

save_bert_topics_from_probs(thresh=0.01, return_df=False, coherence=False, coh_method='u_mass', from_probs=True)

Saves bertopic model results if using probability threshold.

Parameters:
  • thresh (float, optional) – probability threshold used when from_probs=True. The default is 0.01.

  • return_df (boolean, optional) – True to return the results dfs. The default is False.

  • coherence (boolean, optional) – true to calculate coherence for the model and save the results. The default is False.

  • coh_method (string, optional) – Method used to calculate coherence. Can be any method used in gensim. The default is ‘u_mass’.

  • from_probs (boolean, optional) – true to assign topics to documents based on a probability threshold (i.e., documents can have multiple topics). The default is True.

Returns:

topic_prob_dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results

Return type:

dictionary of dataframes

save_bert_vis()

Saves the bertopic visualization and hierarchy visualization.

Return type:

None.

save_hlda_coherence(return_df=False)

Saves hlda coherence to file.

Parameters:

return_df (boolean, optional) – True to return the results df. The default is False.

Returns:

coherence_df – Dataframe with each row a topic, column has coherence scores.

Return type:

pandas DataFrame

save_hlda_document_topic_distribution(return_df=False)

Saves hlda document topic distribution to file.

Parameters:

return_df (boolean, optional) – True to return the results df. The default is False.

Returns:

doc_df – dataframe with a row for each document and the probability for each topic

Return type:

pandas DataFrame

save_hlda_level_n_taxonomy(lev=1, return_df=False)

Saves hlda taxonomy at level n.

Parameters:
  • lev (int, optional) – the level number to save. The default is 1.

  • return_df (boolean, optional) – True to return the results df. The default is False.

Returns:

taxonomy_level_df – Taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents

Return type:

pandas Dataframe

save_hlda_models()

Saves hlda models to file.

Return type:

None.

save_hlda_results()

Saves the taxonomy, level 1 taxonomy, raw topics coherence, and document topic distribution in one excel file.

Return type:

None.

save_hlda_taxonomy(return_df=False, use_labels=False, num_words=10)

Saves hlda taxonomy to file.

Parameters:
  • return_df (boolean, optional) – True to return the results df. The default is False.

  • use_labels (boolean, optional) – True to use topic labels generated from tomotopy. The default is False.

  • num_words (int, optional) – Number of words to display in the taxonomy. The default is 10.

Returns:

taxonomy_df – Taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents

Return type:

pandas Dataframe

save_hlda_topics(return_df=False, p_thres=0.001)

Saves hlda topics to file.

Parameters:
  • return_df (boolean, optional) – True to return the results df. The default is False.

  • p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.001.

Returns:

dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results

Return type:

dictionary of dataframes

save_lda_coherence(return_df=False)

Saves lda coherence to file or returns the dataframe to another function.

Parameters:

return_df (boolean, optional) – True to return the results df. The default is False.

Returns:

coherence_df – Dataframe with each row a topic, column has coherence scores.

Return type:

pandas DataFrame

save_lda_document_topic_distribution(return_df=False)

Saves lda document topic distribution to file or returns the dataframe to another function.

Parameters:

return_df (boolean, optional) – True to return the results df. The default is False.

Returns:

doc_df – dataframe with a row for each document and the probability for each topic

Return type:

pandas DataFrame

save_lda_models()

Saves lda models to file.

Return type:

None.

save_lda_results()

Saves the taxonomy, coherence, and document topic distribution in one excel file.

Return type:

None.

save_lda_taxonomy(return_df=False, use_labels=False, num_words=10)

Saves lda taxonomy to file or returns the dataframe to another function.

Parameters:
  • return_df (boolean, optional) – True to return the results df. The default is False.

  • use_labels (boolean, optional) – True to use topic labels generated from tomotopy. The default is False.

  • num_words (int, optional) – Number of words to display in the taxonomy. The default is 10.

Returns:

taxonomy_df – Taxonomy dataframe with a column for each text column and each row a unique combination of topics found in the documents

Return type:

pandas Dataframe

save_lda_topics(return_df=False, p_thres=0.001)

Saves lda topics to file.

Parameters:
  • return_df (boolean, optional) – True to return the results df. The default is False.

  • p_thres (float, optional) – word-topic probability threshold required for a word to be considered in a topic. The default is 0.001.

Returns:

dfs – dictionary of dataframes where each key is a text column and each value is the corresponding topic model results

Return type:

dictionary of dataframes

save_mixed_taxonomy(use_labels=False)

A custom mixed lda/hlda model taxonomy. Must run lda and hlda with desired parameters first.

Parameters:

use_labels (boolean, optional) – True to use topic labels. The default is False.

Return type:

None.

mika.kd.NER

Description: Utility functions for training custom NER models.

mika.kd.NER.align_labels_with_tokens(labels, word_ids)

Aligns labels and tokens for model training. Adds special tokens to identify tokens not in the tokenizer and the begining of new words.

Parameters:
  • labels (list) – List of NER labels where each label is a string

  • word_ids (list) – List of work ids generated from a tokenizer

Returns:

new_labels – List of new labels with special token labels added.

Return type:

list

mika.kd.NER.build_confusion_matrix(labels, preds, pred_labels, label_list, save=False, savepath='')

Creates confusion matrix for the muliclass NER classification.

Parameters:
  • labels (list) – true labels

  • preds (Numpy Array) – predictions output from model

  • pred_labels (List) – token labels corresponding to predictions. Used for removing special tokens

  • label_list (Dict) – Maps label ids to NER labels.

  • save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.

  • savepath (string, optional) – Path to save the figure to. The default is “”.

Returns:

  • conf_mat (pandas DataFrame) – confusion matrix object

  • true_predictions (list) – List of model prediction labels

  • true_labels (list) – List of true labels

mika.kd.NER.check_doc_to_sentence_split(sentence_df)

Tests that tags are preserved during document to sentence split.

Parameters:

sentence_df (pandas DataFrame) – Dataframe where each row is a sentence. Has columns ‘sentence’ and ‘tags’

Return type:

None.

mika.kd.NER.clean_annots_from_str(df)

Cleans annotated documents by removing excess symbols. Use if the text is a string from a list of words.

Parameters:

df (pandas dataframe) – Dataframe of the annotated document set.

Returns:

df – Dataframe of the cleaned annotated document set.

Return type:

pandas dataframe

mika.kd.NER.clean_doccano_annots(df)

Cleans annotated documents.

Parameters:

df (pandas dataframe) – Dataframe of the annotated document set.

Returns:

df – Dataframe of the cleaned annotated document set.

Return type:

pandas dataframe

mika.kd.NER.clean_text_tags(text, labels)

Cleans an individual text and label set by removing extra spaces or punctuation at the beginning or end of a string, fixing annotations that are missing a proceeding or preceeding character, and adding spaces into text that is misisng a space.

Parameters:
  • text (string) – text accosicated with the labels

  • labels (list) – list of labels from annotation where each label in the list is a tuple with structure (1,2,3), where 1 is the beginning character location, 2 is the end character location, and 3 is the label name

Returns:

  • new_labels (list) – cleaned list of labels

  • text (string) – cleaned string corresponding to the labels.

mika.kd.NER.compute_classification_report(labels, preds, pred_labels, label_list)

Computes classification report.

Parameters:
  • labels (list) – true labels

  • preds (Numpy Array) – predictions output from model

  • pred_labels (List) – token labels corresponding to predictions. Used for removing special tokens

  • label_list (Dict) – Maps label ids to NER labels.

Returns:

Output of classification results

Return type:

Classification Report

mika.kd.NER.compute_metrics(eval_preds, id2label)

Calculates sequence classification metrics. Used during trainig.

Parameters:
  • eval_preds (Tensor) – Pytorch tensor generated during training/evaluation.

  • id2label (Dict) – Dict mapping numeric ids to labels.

Return type:

Dict of metrics.

mika.kd.NER.get_cleaned_label(label)

removes BILOU component of tag

Parameters:

label (string) – NER label potentially with BILOU tag.

Returns:

label – Label with BILOU component removed.

Return type:

string

mika.kd.NER.identify_bad_annotations(text_df)

Used for identifying invalid annotations which typically occur from inaccurate tagging, for example, tagging an extra space or punctuation. Must be ran prior to any model training or training will fail on bad annotations.

Parameters:

text_df (pandas DataFrame) – Dataframe storing the documents. Must contain ‘docs’ column and a ‘tags’ column.

Returns:

bad_tokens – list of tokens that are invalid.

Return type:

list

mika.kd.NER.plot_eval_metrics(eval_df, save, savepath)

Plots classification metrics on the evaluation set over training.

Parameters:
  • eval_df (pandas DataFrame) – Dataframe of evaluation metrics

  • save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.

  • savepath (string, optional) – Path to save the figure to. The default is “”.

Return type:

None.

mika.kd.NER.plot_eval_results(filepath, final_train_metrics={}, final_eval_metrics={}, save=False, savepath=None, loss=True, metrics=True)

Plots evaluation and training metrics as specified using other functions.

Parameters:
  • filepath (string) – Path to the checkpoint logs.

  • final_train_metrics (Dict, optional) – Dictionary of final training metrics in case they are not recorded in the log. The default is {}.

  • final_eval_metrics (Dict, optional) – Dictionary of final eval metrics in case they are not recorded in the log. The default is {}.

  • save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.

  • savepath (string, optional) – Path to save the figure to. The default is “”.

  • loss (Boolean, optional) – True to plot loss False to not plot loss. The default is True.

  • metrics (Boolean, optional) – True to plot evaluation metrics. The default is True.

Return type:

None.

mika.kd.NER.plot_loss(eval_df, training_df, save, savepath)

Plots the validation and training loss over training.

Parameters:
  • eval_df (pandas DataFrame) – Dataframe of evaluation metrics

  • training_df (pandas DataFrame) – Dataframe of training metrics

  • save (Boolean, optional) – True to save the figure as pdf, false to not save. The default is False.

  • savepath (string, optional) – Path to save the figure to. The default is “”.

Return type:

None.

mika.kd.NER.read_doccano_annots(file, encoding=False)

Reads in a .jsonl file containing annotations from doccano.

Parameters:
  • file (string) – File location.

  • encoding (boolean, optional) – True if the file should be opened with utf-8 encoding. The default is False.

Returns:

df – Dataframe of the annotated document set.

Return type:

pandas dataframe

mika.kd.NER.read_trainer_logs(filepath, final_train_metrics, final_eval_metrics)

Reads training logs stored at a checkpoint to get training metrics.

Parameters:
  • filepath (string) – Path to the checkpoint logs.

  • final_train_metrics (Dict) – Dictionary of final training metrics in case they are not recorded in the log

  • final_eval_metrics (Dict) – Dictionary of final eval metrics in case they are not recorded in the log

Returns:

  • eval_df (pandas DataFrame) – Dataframe of evaluation metrics

  • training_df (pandas DataFrame) – Dataframe of training metrics

mika.kd.NER.split_docs_to_sentences(text_df, id_col='Tracking #', tags=True)

Splits a dataframe of documents into a new dataframe where each sentence from each document is in its own row. This is useful for using BERT models with maximum character limits.

Parameters:
  • text_df (pandas DataFrame) – Dataframe storing the documents. Must contain ‘docs’ column and an id column.

  • id_col (string, optional) – The column in the dataframe storing the doucment ids. The default is ‘Tracking #’.

  • tags (boolean, optional) – True if the dataframe also contains tags or annotations. False if no annotations are present. The default is True.

Returns:

sentence_df – New dataframe where each row is a sentence.

Return type:

pandas DataFrame

mika.kd.NER.tokenize(sentence_df, tokenizer)

Tokenizer used during dataset creation when the input data has already been tokenized. Currently unused, can be used to get token ids.

Parameters:
  • sentence_df (pandas DataFrame) – Dataframe where each row is a sentence. Has columns ‘sentence’ and ‘tags’

  • tokenizer (tokenizer object) – tokenizer object, usually an autotokenizer Transformers BERT model

Returns:

tokenized_inputs – list of tokens with corresponding IDs from the tokenizer

Return type:

list

mika.kd.NER.tokenize_and_align_labels(sentence_df, tokenizer, align_labels=True)

Tokenizes text and aligns with labels. Necessary due to subword tokenization with BERT.

Parameters:
  • sentence_df (pandas DataFrame) – Dataframe where each row is a sentence. Has columns ‘sentence’ and ‘tags’

  • tokenizer (tokenizer object) – tokenizer object, usually an autotokenizer Transformers BERT model

  • align_labels (Boolean, optional) – True if labels should be aligned. The default is True.

Returns:

tokenized_inputs – Tokenized inputw with new labels alligned to corresponding tokens

Return type:

Object

mika.kd.trend_analysis

Description: Trend analysis functions.

mika.kd.trend_analysis.add_hazards_to_docs(preprocessed_df, id_field, docs)

Add hazards to documents

Add hazard values to preprocessed_df for running chi-squared tests

Parameters:
  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents.

  • id_field (string) – the column in preprocessed df that contains document ids

  • docs (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

Returns:

preprocessed_df – pandas datframe containing documents.

Return type:

pandas DataFrame

mika.kd.trend_analysis.bootstrap_metric(metric_data, time_vals, metric_percentages, num_means=1000, CI_interval=95)

Bootstrap metric

Performs bootstrapping to better estimate the true metric value given a metric percentage (i.e., hazard extraction accuracy).

Parameters:
  • metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and list of metrics for values.

  • time_vals (list) – list of time values in the time series. generated in plot metric time series.

  • metric_percentages (dictionary) – dict with hazards as keys and values as the percentage to be sampled for each bootstrap. this can be input as the hazard extraction accuracy to try to better capture the true populate mean.

  • num_means (int, optional) – the number of means calculated via bootstrapping, by default 1000

  • CI_interval (int, optional) – level of confidence for the interval, by default 95

Returns:

  • averages (dictionary) – dictionary with hazards as keys and the value is a list of metric averages with one average per year in the time series.

  • CI (dictionary) – dictionary with hazards as keys and value is a (2,n) array where n is the number of years of data.

mika.kd.trend_analysis.build_word_clouds(word_frequencies, nrows, ncols, figsize=(8, 4), cmap=None, save=False, save_path=None, fontsize=10, wordcloud_kwargs={})

Word clouds

Builds a word cloud for each hazard.

Parameters:
  • word_frequencies (dictionary) – nested dictionary where keys are hazards. inner dictionary has words as keys and word frequencies as values.

  • nrows (int) – number of rows in the grid of word clouds

  • ncols (int) – number of columns in the grid of word clouds

  • figsize (tuple, optional) – figure size in inches. The default is (8, 4).

  • cmap (matplotlib colormap, optional) – colormap object used for coloring the word clouds. The default is None.

  • save (boolean, optional) – true to save figure. The default is False.

  • save_path (string, optional) – path to save figure to. The default is None.

  • fontsize (int, optional) – fontsize for title and minimum fontsize in wordcloud. The default is 10.

  • wordcloud_kwargs (dict, optional) – optional keyword args to pass into the wordcloud object. The default is {}.

Return type:

None

mika.kd.trend_analysis.calc_CI(bootstrapped_means, CI_interval=95)

Calculate confidence interval

Calculates a confidence interval from a list of means generated using bootstrapping.

Parameters:
  • bootstrapped_means (dictionary) – nested dictionary with hazards as keys, inner dictionary with years as keys and a list of bootstrapped means as the value.

  • CI_interval (int, optional) – level of confidence for the interval, by default 95

Returns:

CI – dictionary with hazards as keys and value is a (2,n) array where n is the number of years of data.

Return type:

dictionary

mika.kd.trend_analysis.calc_classification_metrics(labeled_file, docs_per_hazard, id_col)

Calculate classification metrics

Calculates classification metrics where the true lables come from a file and the predicted labels are from the hazard extraction process.

Parameters:
  • labeled_file (string) – location the labeled file is saved at

  • docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • id_col (string) – column in preprocessed_df containing document ids

Returns:

  • metrics_df (pandas DataFrame) – dataframe with recall, precision, f1, accuracy, and support for each hazard

  • labeled_docs (pandas DataFrame) – dataframe of the manually labled docs

  • HEAT_labeled_docs (pandas DataFrame) – dataframe of the HEAT labled docs

mika.kd.trend_analysis.calc_rate(frequency)

Calculate rate of occurrence of hazard

Calculates the average rate of occurrence for a hazard from the frequency per year.

Parameters:

frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.

Returns:

rates – dictionary with hazard name as keys and a rate as a value

Return type:

dict

mika.kd.trend_analysis.calc_severity_per_hazard(docs, df, id_field, metric='average')

Calculate severity per hazard

Used to calculate the severity for each hazard occurrence, as well as the average severity.

Parameters:
  • docs (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • df (pandas DataFrame) – pandas datframe containing documents with severity per document already calculated

  • id_field (string) – the column in df that contains document ids

  • metric (string, optional) – whether to calculate the average or maximum severity, default is ‘average’

Returns:

  • severities (Dict) – nested dictionary used to store severities per hazard occurrence. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists of severities.

  • total_severities_hazard (Dict) – Dictionary used to store average severity per hazard. Keys are hazards and value is average severity.

mika.kd.trend_analysis.check_for_hazard_words(h_word, text)

Check for hazard words

Checks to see if a section of text contains a hazard word.

Parameters:
  • h_word (string) – hazard word

  • text (string) – section of text being searched for hazard word.

Returns:

hazard_found – true if hazard word appears in text, false if not.

Return type:

boolean

mika.kd.trend_analysis.check_for_negation_words(negation_words, text, h_word)

Check for negation words

Checks to see if any negation words appear within 3 words of a hazard word in a specified section of text.

Parameters:
  • negation_words (list) – list of negation words.

  • text (string) – section of text being searched for hazard and negation words.

  • h_word (string) – hazard word

Returns:

hazard_found – true if hazard word appears in text with no negation words, false if not i.e., a negation word is present.

Return type:

boolean

mika.kd.trend_analysis.chi_squared_tests(preprocessed_df, hazards, predictors, pred_dict={})

Chi squared tests

Performs chi-squared test for each predictor to determine if there is a statistically significant difference in the counts of the predictor between reports with and without each hazard

Parameters:
  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents.

  • hazards (list) – list of hazards

  • predictors (list) – list of columns in the dataframe with categorical predictors of interest.

  • pred_dict (dict, optional) – dictionary with predictors from predicotr list as keys and names to disply in the table as values, default is {}.

Returns:

  • stats_df (pandas DataFrame) – pandas dataframe containing the chi-squared statistic and p-val for each hazard-predictor pair

  • count_dfs (Dict) – dictionary of pandas dataframes containing the counts for each predictor and hazard

mika.kd.trend_analysis.corr_sig(df=None)

Correlation signature

Returns the probability matrix for a correlation matrix.

Parameters:

df (Pandas Dataframe, optional) – df storing the data that is used to create the correlation matrix. rows are years, columns are predictors + hazard frequencies, by default None

Returns:

p_matrix – numpy array with p_values for corresponding dataframe entries

Return type:

numpy array

mika.kd.trend_analysis.create_correlation_matrix(predictors_scaled, frequencies_scaled, graph=True, mask_vals=False, figsize=(6, 4), fontsize=12, save=False, results_path='', title=False)

Create correlation matrix

Creates the correlation matrix between all predictors and all hazard frequencies. All arguments are outputs from create_metrics_time_series.

Parameters:
  • predictors_scaled (dict) – dictionary with keys as predictor names, values as a time series list of values scaled using minmax

  • frequencies_scaled (dict) – dictionary with keys as hazard names, values as times series list of frequencies scaled using minmax

  • graph (boolean, optional) – True to graph the data, false to not graph. The default is True.

  • mask_vals (boolen, optional) – True to mask values that are not significant. The default is False.

  • figsize (tuple, optional) – size of the plot in inches. The default is (6,4).

  • fontsize (int, optional) – fontsize for the plot. The default is 16.

  • save (boolean, optional) – True to save the graph. The default is False.

  • results_path (string, optional) – path to save the plot to. The default is “”.

  • title (Boolean, optional) – True to show a title on graph. The default is False.

Returns:

  • corrMatrix (pandas DataFrame) – correlation matrix.

  • correlation_mat_total (pandas DataFrame) – stores the hazard and predictor values used for the correlation matrix

  • p_values (pandas DataFrame) – p-vals for each correlation.

mika.kd.trend_analysis.examine_hazard_extraction_mismatches(preprocessed_df, true, pred, hazards, hazard_words_per_doc, topics_per_doc, hazard_topics_per_doc, id_col, text_col, results_path)

Examine hazard extraction mismatches

Used to examine which documents are mislabled by HEAT. used iteratively to refine hazard extraction.

Parameters:
  • preprocessed_df (TYPE) – DESCRIPTION.

  • true (pandas DataFrame) – dataframe of the manually labled docs

  • pred (pandas DataFrame) – dataframe of the HEAT labled docs

  • hazards (list) – list of hazards

  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.

  • topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers

  • hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics

  • id_col (string) – column in preprocessed_df containing document ids

  • text_col (string) – column in preprocessed_df containing text

  • results_path (string) – location to save the resulting datasheets to

Returns:

dfs – dictionary with keys as hazards and values as pandas dataframes storing document mismatches

Return type:

dict

mika.kd.trend_analysis.get_doc_text(id_, temp_df, id_field, text_field)

Get document text

Gets the text for a document.

Parameters:
  • id_ (string) – id of the specified document

  • temp_df (pandas DataFrame) – subset of preprocessed_df only containing documents associated with the specified hazard

  • id_field (string) – the column in preprocessed df that contains document ids

  • text_field (string) – the column in preprocessed df that stores the text. can be different from results_text_field, but is usually the same

Returns:

text – the document text

Return type:

string

mika.kd.trend_analysis.get_doc_time(id_, temp_df, id_field, time_field)

Get document time

Gets the time value for a document. usually the year of the report, but could also be month or any other time value.

Parameters:
  • id_ (string) – id of the specified document

  • temp_df (pandas DataFrame) – subset of preprocessed_df only containing documents associated with the specified hazard

  • id_field (string) – the column in preprocessed df that contains document ids

  • time_field (string) – the column in preprocessed df that contains document time values, such as report year

Returns:

year – the time value for the specified document, usually a year

Return type:

str

mika.kd.trend_analysis.get_hazard_df(hazard_info, hazards, i)

Get hazard dataframe

Gets hazard information for a specified hazard.

Parameters:
  • hazard_info (pandas DataFrame) – pandas dataframe with columns for hazard names, hazard words, and topics per hazard

  • hazards (list) – list of hazard names

  • i (int) – the index of the specified hazard in hazard list.

Returns:

  • hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard

  • hazard_name (string) – name of the specified hazard

mika.kd.trend_analysis.get_hazard_doc_ids(nums, results, results_text_field, docs, doc_topic_distribution, text_field, topic_thresh, preprocessed_df, id_field)

Get hazard document IDs

Gets the document ids associated with a specified hazard.

Parameters:
  • nums (list) – list of topic numbers associated with the hazard

  • results (pandas DataFrame) – dataframe with topic modeling results generated using topic model plus

  • results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column

  • docs (list) – list of document ids.

  • doc_topic_distribution (pandas DataFrame) – dataframe containing topic distributions per document

  • text_field (string) – the column in preprocessed df that stores the text. can be different from results_text_field, but is usually the same

  • topic_thresh (float) – the probability threshold a document must have to be considered in a topic

  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents

  • id_field (string) – the column in preprocessed df that contains document ids

Returns:

  • temp_df (pandas DataFrame) – subset of preprocessed_df only containing documents associated with the specified hazard

  • ids_ (list) – list of document ids associated with the specified hazard

mika.kd.trend_analysis.get_hazard_info(hazard_file)

Get hazard information

Loads hazard information from hazard spreadsheet.

Parameters:

hazard_file (string) – filepath to hazard interpretation spreadsheet

Returns:

  • hazard_info (pandas DataFrame) – pandas dataframe with columns for hazard names, hazard words, and topics per hazard

  • hazards (list) – list of hazard names

mika.kd.trend_analysis.get_hazard_topics(hazard_df, begin_nums)

Get hazard topics

Gets topic numbers for a hazard from the hazard dataframe.

Parameters:
  • hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard

  • begin_nums (int) – The topic index to begin at. is -1 if using bertopic since the top level ‘topic’ is really the cluster of documents not belonging to a topic. 0 otherwise.

Returns:

nums – list of topic numbers associated with the hazard

Return type:

list

mika.kd.trend_analysis.get_hazard_topics_per_doc(ids, topics_per_doc, hazard_topics_per_doc, hazard_name, nums, begin_nums)

Get hazard topics per document

Gets the topics per document that are associated with a specied hazard.

Parameters:
  • ids (list) – list of document ids associated with the specified hazard

  • topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers

  • hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topic. inner dictionary values are all empty lists for input variable.

  • hazard_name (string) – name of specified hazard

  • nums (list) – list of topic numbers associated with the hazard

  • begin_nums (int) – The topic index to begin at. is -1 if using bertopic since the top level ‘topic’ is really the cluster of documents not belonging to a topic. 0 otherwise.

Returns:

hazard_topics_per_doc – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics

Return type:

dict

mika.kd.trend_analysis.get_hazard_words(hazard_df)

Get hazard words

Gets the hazard words for a specified hazard dataframe.

Parameters:

hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard

Returns:

hazard_words – list of hazard words

Return type:

list

mika.kd.trend_analysis.get_likelihood_FAA(rates)

FAA likelihood

Converts hazard rate of occurrence to an FAA likelihood category.

Parameters:

rates (dict) – dictionary with hazard name as keys and a rate as a value

Returns:

curr_likelihoods – dictionary with hazard name as keys and a likelihood category as a value

Return type:

dict

mika.kd.trend_analysis.get_likelihood_USFS(rates)

USFS likelihood

Converts hazard rate of occurrence to an USFS likelihood category.

Parameters:

rates (dict) – dictionary with hazard name as keys and a rate as a value

Returns:

curr_likelihoods – dictionary with hazard name as keys and a likelihood category as a value

Return type:

dict

mika.kd.trend_analysis.get_negation_words(hazard_df)

Get negation words

Gets the negation words for a specified hazard dataframe

Parameters:

hazard_df (pandas DataFrame) – dataframe containing the information (topics, words, etc) for the specified hazard

Returns:

negation_words – list of negation words

Return type:

list

mika.kd.trend_analysis.get_results_info(results_file, results_text_field, text_field, doc_topic_dist_field)

Get results information

Pulls topic modeling results from results spreadsheet generated with topic model plus.

Parameters:
  • results_file (string) – filepath to results spreadsheet

  • results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column

  • text_field (string) – the text field of interest in the preprocessed_df. sometimes it is different from results_text_field but it is usually the same.

  • doc_topic_dist_field (string or None) – the column storing the topic distribution per document information. Can be ommitted, only used when a user wants to filter results so a document only belings to a topic if the probability is above a specified threshold.

Returns:

  • results (pandas DataFrame) – dataframe with topic modeling results generated using topic model plus

  • results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column

  • doc_topic_distribution (pandas DataFrame) – dataframe containing topic distributions per document

  • begin_nums (int) – The topic index to begin at. is -1 if using bertopic since the top level ‘topic’ is really the cluster of documents not belonging to a topic. 0 otherwise.

mika.kd.trend_analysis.get_topics_per_doc(docs, results, results_text_field, hazards)

Get topics per document

Finds the topics associated with each document.

Parameters:
  • docs (list) – list of document ids.

  • results (pandas DataFrame) – dataframe with topic modeling results generated using topic model plus

  • results_text_field (string) – column in result dataframe where topic numbers are stored for the specified text column

  • hazards (list) – list of hazards

Returns:

  • topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers

  • hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics

mika.kd.trend_analysis.get_word_frequencies(hazard_words_per_doc, hazards_sorted=None)

Word frequencies

Calculate word frequencies.

Parameters:
  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.

  • hazards_sorted (list, optional) – ordered list of hazards for generating frequencies. The default is None.

Returns:

word_frequencies – nested dictionary where keys are hazards. inner dictionary has words as keys and word frequencies as values.

Return type:

dictionary

mika.kd.trend_analysis.hazard_accuracy(docs_per_hazard, num, results_path, hazard_words_per_doc, preprocessed_df, text_col, id_col, seed=0)

Hazard accuracy

Creates a data sheet to calculate hazard extraction acccuracy by randomly sampling documents for each hazard. This method actually is calculating precision at k, k=num. Note that this method is not the prefered method to evaluate hazard accuracy, Instead a user should use the classification metrics and label a validation set.

Parameters:
  • docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • num (int) – number of documents to sample for each hazard.

  • results_path (string) – filepath to topic modelresults spreadsheet

  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.

  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents

  • text_col (string) – column in preprocessed_df containing text

  • id_col (string) – column in preprocessed_df containing document ids

  • seed (int, optional) – seed for random sampling. The default is 0.

Returns:

  • sampled_hazard_ids (dict) – dictionary with hazards as keys and a list of document ids for values

  • total_ids (list) – list of all the ids of documents belonging to any hazard

mika.kd.trend_analysis.identify_docs_per_fmea_row(df, grouping_col, year_col, id_col)

Identify documents per FMEA row

Identifies the documents corresponding to each row in an FMEA.

Parameters:
  • df (pandas DataFrame) – pandas datframe containing documents.

  • grouping_col (string) – the column in df that is used to group documents into FMEA rows

  • year_col (string) – the column in df that contains document time values, such as report year

  • id_col (string) – the column in df that contains document ids

Returns:

  • frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.

  • docs_per_row (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

mika.kd.trend_analysis.identify_docs_per_hazard(hazard_file, preprocessed_df, results_file, text_field, time_field, id_field, results_text_field=None, doc_topic_dist_field=None, topic_thresh=0.0)

Identify documents per hazard

Outputs the documents per hazard.

Parameters:
  • hazard_file (string) – filepath to hazard interpretation spreadsheet

  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents.

  • results_file (string) – filepath to results spreadsheet

  • text_field (string) – the text field of interest in the preprocessed_df. sometimes it is different from results_text_field but it is usually the same.

  • time_field (string) – the column in preprocessed df that contains document time values, such as report year

  • id_field (string) – the column in preprocessed df that contains document ids

  • results_text_field (string, optional) – column in result dataframe where topic numbers are stored for the specified text column The default is None.

  • doc_topic_dist_field (string or None, optional) – the column storing the topic distribution per document information. Can be ommitted, only used when a user wants to filter results so a document only belings to a topic if the probability is above a specified threshold. The default is None.

  • topic_thresh (float, optional) – the probability threshold a document must have to be considered in a topic. The default is 0.0.

Returns:

  • frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.

  • docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.

  • topics_per_doc (dict) – dictionary with keys as document ids and values as a list of topic numbers

  • hazard_topics_per_doc (dict) – nested dictionary with keys as document ids. inner dictionary has hazard names as keys and values as a list of topics

mika.kd.trend_analysis.make_pie_chart(docs, data, predictor, hazards, id_field, predictor_label=None, save=True, results_path='', pie_kwargs={}, fontsize=16, figsize=(17, 9), padding=5, legend_kwargs={})

Make pie chart

Makes a set of pie charts, with one pie chart per hazard showing the distribution of the categorical predictor variable specified.

Parameters:
  • docs (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • data (pandas DataFrame) – pandas datframe containing documents.

  • predictor (string) – column in the data that has the categorical predictor of interest

  • hazards (list) – list of hazards

  • id_field (string) – the column in preprocessed df that contains document ids

  • predictor_label (string, optional) – predictor label to be shown in the figure title, by default None

  • save (bool, optional) – True to save the figure, by default True

  • results_path (string, optional) – path to save figure to. The default is “”.

  • pie_kwargs (Dict, optional) – Dictionary to pass kwargs into the piechart, default an empty dictionary

  • fontsize (int, optional) – fontsize for the plot. The default is 16.

  • figsize (tuple, optional) – size of the figure in inches. The default is (17, 9)

  • padding (int, optional) – the padding between graphs. The deafult is 5

  • legend_kwargs (dict, optional) – dictionary to pass in legend options

mika.kd.trend_analysis.minmax_scale(data_list)

Minmax scale

Performs minmax scaling on a single data list in order to normalize the data. normalization is required prior to regression and ML. Also it is used for graphing multiple time series on the same axes.

Parameters:

data_list (list) – list of numerical data that will be scaled using minmax scaling

Returns:

scaled_list – list of data scaled between 0-1

Return type:

list

mika.kd.trend_analysis.multiple_reg_feature_importance(predictors, hazards, correlation_mat_total, save=False, results_path='', r2_fontsize=10, r2_figsize=(3.5, 4), predictor_import_fontsize=10, predictor_import_figsize=(7, 4))

Multiple regression feature importance

Builds multiple regression model for hazrd frequency given the predictors. Also performs predictor importance to identify which predictors are most relevant to the hazards frequency. Predictors and hazards must be values from the correlation_mat_total.

Parameters:
  • predictors (list) – list of predictor names, used to identify inputs to multiple regression

  • hazards (list) – list of hazard names, used to identify targets for multiple regression

  • correlation_mat_total (dataframe) – stores the time series values that were used for correlation matrix. rows are years, columns are predictors + hazard frequencies

  • save (bool, optional) – True to save resulting figure, by default False

  • results_path (str, optional) – _description_, by default “”

  • r2_fontsize (int, optional) – Frontsize for r2 figure, by default 10

  • r2_figsize (tuple, optional) – figsize for r2 figure, by default (3.5,4)

  • predictor_import_fontsize (int, optional) – fontsize for predictor importancce figure, by default 10

  • predictor_import_figsize (tuple, optional) – figsize for predictor importance figure, by default (7,4)

Returns:

  • results_df (Pandas Dataframe) – dataframe containing both the mse and r2 scores for the model

  • delta_df (Pandas Dataframe) – dataframe containing the difference between the total model and the model when the predictor is shuffled

  • coefficient_df (Pandas Dataframe) – dataframe containing the regression coefficients for predictor importance

mika.kd.trend_analysis.plot_USFS_risk_matrix(likelihoods, severities, figsize=(9, 5), save=False, results_path='', fontsize=12, max_chars=20, title=False)

Plot USFS risk matrix

Plots a USFS risk matrix from likelihood and severity categories.

Parameters:
  • likelihoods (dict) – dictionary with hazard name as keys and a likelihood category as a value

  • severities (dict) – dictionary with hazard name as keys and a severity category as a value

  • figsize (tuple, optional) – figure size in inches. The default is (9,5).

  • save (boolean, optional) – true to save the figure. The default is False.

  • results_path (string, optional) – path to save figure to. The default is “”.

  • fontsize (int, optional) – figure fontsize. The default is 12.

  • max_chars (int, optional) – maximum characters per line in a cell of the risk matrix. used to improve readability and ensure hazard names are contained in a cell. The default is 20.

  • title (boolean, optional) – Dtrue to show title. The default is False.

Return type:

None.

mika.kd.trend_analysis.plot_frequency_time_series(metric_data, metric_name='Frequency', line_styles=[], markers=[], title='', time_name='Year', xtick_freq=5, scale=True, save=False, results_path='', yscale=None, legend=True, figsize=(6, 4), fontsize=16, interval=False, interval_kwargs={'false_neg_rate': 0.05, 'false_pos_rate': 0.05}, legend_kwargs={})

Plot frequency time series

Plots hazard frequency over time. Different from plot metric time series because of input data

Parameters:
  • metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and frequency count as an integer for values.

  • metric_name (string, optional) – name of metric. The default is ‘Frequency’.

  • line_styles (list, optional) – list of line styles to use. should have one value for each hazard. The default is [].

  • markers (list, optional) – list of line markers to use. should have one value for each hazard. The default is [].

  • title (string, optional) – title to add to plot. The default is “”.

  • time_name (string, optional) – name of the time interval used to label the x axis. The default is “Year”.

  • xtick_freq (int, optional) – the number of values per x tick, e.g., the default would go 2015, 2020, 2025. The default is 5.

  • scale (boolean, optional) – true to minmax scale data, false to use raw data. The default is False.

  • save (boolean, optional) – true to save the plot as a pdf. The default is False.

  • results_path (string, optional) – path to save figure to. The default is “”.

  • yscale (string, optional) – yscale parameter, can be used to change scaling to log. The default is None.

  • legend (boolean, optional) – true to show legend, false to hide legend. The default is True.

  • figsize (tuple, optional) – size of the plot in inches. The default is (6,4).

  • fontsize (int, optional) – fontsize for the plot. The default is 16.

  • interval (bool, optional) – true to contstruct an interval for the metric based on false pos/neg rate. The default is False.

  • interval_kwargs (dict, optional) – dictionary to pass in the false positive and false negative rate for each hazard. The default is {‘false_pos_rate’:0.05, ‘false_neg_rate’:0.05}.

  • legend_kwargs (dict, optional) – dictionary to pass in legend options

Return type:

None.

mika.kd.trend_analysis.plot_metric_averages(metric_data, metric_name, show_std=True, title='', save=False, results_path='', yscale=None, legend=True, figsize=(6, 4), fontsize=16, error_bars='stddev')

Plot metric averages

Plots metric averages as a barchart.

Parameters:
  • metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and list of metrics for values.

  • metric_name (string) – name of metric, e.g., severity, used for axis and saving the figure

  • show_std (boolean, optional) – true to show std deviation on time series as error bars. The default is True.

  • title (string, optional) – title to add to plot. The default is “”.

  • save (boolean, optional) – true to save the plot as a pdf. The default is False.

  • results_path (string, optional) – path to save figure to. The default is “”.

  • yscale (string, optional) – yscale parameter, can be used to change scaling to log. The default is None.

  • legend (boolean, optional) – true to show legend, false to hide legend. The default is True.

  • figsize (tuple, optional) – size of the plot in inches. The default is (6,4).

  • fontsize (int, optional) – fontsize for the plot. The default is 16.

  • error_bars (string) – type of error bars to use, can be ‘stddev’ or ‘CI’. The default is ‘stddev’.

Return type:

None.

mika.kd.trend_analysis.plot_metric_time_series(metric_data, metric_name, line_styles=[], markers=[], title='', time_name='Year', scaled=False, xtick_freq=5, show_std=True, save=False, results_path='', yscale=None, legend=True, figsize=(6, 4), fontsize=16, bootstrap=False, bootstrap_kwargs={'CI_interval': 95, 'metric_percentages': 1, 'num_means': 1000}, legend_kwargs={})

Plot metric time series

Plots a time series for specified metrics for all hazards (i.e., line chart).

Parameters:
  • metric_data (dict) – nested dict where keys are hazards. inner dict has time value (usually years) as keys and list of metrics for values.

  • metric_name (string) – name of metric, e.g., severity, used for axis and saving the figure

  • line_styles (list, optional) – list of line styles to use. should have one value for each hazard. The default is [].

  • markers (list, optional) – list of line markers to use. should have one value for each hazard. The default is [].

  • title (string, optional) – title to add to plot. The default is “”.

  • time_name (string, optional) – name of the time interval used to label the x axis. The default is “Year”.

  • scaled (boolean, optional) – true to minmax scale data, false to use raw data. The default is False.

  • xtick_freq (int, optional) – the number of values per x tick, e.g., the default would go 2015, 2020, 2025. The default is 5.

  • show_std (boolean, optional) – true to show std deviation on time series as error bars. The default is True.

  • save (boolean, optional) – true to save the plot as a pdf. The default is False.

  • results_path (string, optional) – path to save figure to. The default is “”.

  • yscale (string, optional) – yscale parameter, can be used to change scaling to log. The default is None.

  • legend (boolean, optional) – true to show legend, false to hide legend. The default is True.

  • figsize (tuple, optional) – size of the plot in inches. The default is (6,4).

  • fontsize (int, optional) – fontsize for the plot. The default is 16.

  • bootstrap (Boolean, optional) – true to bootstrap the data for a confidence interval. The default is False.

  • bootstrap_kwargs (dict, optional) – dictionary to pass bootstrapping parameters

  • legend_kwargs (dict, optional) – dictionary to pass in legend options

Return type:

None.

mika.kd.trend_analysis.plot_predictors(predictors, predictor_labels, time, time_label='Year', title='', totals=True, averages=True, scaled=True, figsize=(12, 5), axs=[], fig=None, show=False, legend=True, legend_kwargs={})

Plot predictors

Plots predictor timeseries.

Parameters:
  • predictors (list) – list of predictors

  • predictor_labels (list) – list of labels for the predictors which will be shown on the graph

  • time (list) – list of time values tat define the axis/time series.

  • time_label (string, optional) – label for the time values. The default is ‘Year’.

  • title (string, optional) – figure title. The default is “”.

  • totals (boolean, optional) – true to graph total or sum values. The default is True.

  • averages (boolean, optional) – true to graph average values. The default is True.

  • scaled (boolean, optional) – true to minmax scale the timeseries data. The default is True.

  • figsize (tuple, optional) – figure size in inches. The default is (12, 5).

  • axs (matplotlib axs object, optional) – used to plot multiple graphs on one figure. The default is [].

  • fig (matplotlib fig object, optional) – used to plot multiple graphs on one figure. The default is None.

  • show (boolean, optional) – true to show figure, false to return graph objects. The default is False.

  • legend (boolean, optional) – true to show legend. The default is True.

  • legend_kwargs (dict, optional) – dictionary to pass in legend options

Returns:

  • fig (matplotlib fig object) – the figure object

  • axs (matplotlib axs object) – axs object

mika.kd.trend_analysis.plot_risk_matrix(likelihoods, severities, figsize=(9, 5), save=False, results_path='', fontsize=12, max_chars=20, annot_font=12)

Plot risk matrix

Plots a FAA risk matrix from likelihood and severity categories.

Parameters:
  • likelihoods (dict) – dictionary with hazard name as keys and a likelihood category as a value

  • severities (dict) – dictionary with hazard name as keys and a severity category as a value

  • figsize (tuple, optional) – figure size in inches. The default is (9,5).

  • save (boolean, optional) – true to save the figure. The default is False.

  • results_path (string, optional) – path to save figure to. The default is “”.

  • fontsize (int, optional) – figure fontsize. The default is 12.

  • max_chars (int, optional) – maximum characters per line in a cell of the risk matrix. used to improve readability and ensure hazard names are contained in a cell. The default is 20.

  • annot_font (int, optional) – figure annotation fontsize. The default is 12.

Return type:

None.

mika.kd.trend_analysis.proposed_topics(lists=[])

Proposed topics

Experimental function to identify topics that may be relevent to specified hazards based on manually labeled data.

Parameters:

lists (list, optional) – list of lists. inner lists are topic numbers for each document manually labeled as associated with a hazard. The default is [].

Returns:

proposed_topics – list of proposed new topics.

Return type:

list

mika.kd.trend_analysis.record_hazard_doc_info(hazard_name, year, docs_per_hazard, id_, frequency, hazard_words_per_doc, docs, h_word)

Record hazard document information

Saves the information for a specified document that contains a specified hazard.

Parameters:
  • hazard_name (string) – name of specified hazard

  • year (int or str) – year that the report occurs in

  • docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • id_ (string) – id of the specified document

  • frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.

  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.

  • docs (list) – list of document ids.

  • h_word (string) – hazard word

Returns:

  • docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.

  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.

mika.kd.trend_analysis.regression_feature_importance(predictors, hazards, correlation_mat_total)

Regression feature importance

Performs feature importance for a set of linear regression analyses.

Parameters:
  • predictors (list) – list of predictor names, used to identify inputs to single linear regression

  • hazards (list) – list of hazard names, used to identify targets for single linear regression

  • correlation_mat_total (dataframe) – stores the time series values that were used for correlation matrix. rows are years, columns are predictors + hazard frequencies

Returns:

results_df – dataframe containing both the mse and r2 scores for the model

Return type:

pandas dataframe

mika.kd.trend_analysis.remove_outliers(data, threshold=1.5, rm_outliers=True)

Remove outliers

Removes outliers from the dataset using inter quartile range.

Parameters:
  • data (list) – list of data points

  • threshold (float, optional) – Threshold for the distance outside of the interquartile range that defines an outlier. The default is 1.5.

  • rm_outliers (Booleam, optional) – True to remove outliers, false to return original data. The default is True.

Returns:

new_data – list of the data points with outliers removed

Return type:

list

mika.kd.trend_analysis.reshape_correlation_matrix(corrMatrix, p_values, predictors, hazards, figsize=(8, 8.025), fontsize=16)

Reshape correlation matrix

Reshapes the correlation matrix between all predictors and all hazard frequencies. Columns are predictors and rows are hazards. Arguments are outputs from create_correlation_matrix.

Parameters:
  • corrMatrix (pandas DataFrame) – correlation matrix.

  • p_values (pandas DataFrame) – p-vals for each correlation.

  • predictors (list) – list of predictors

  • hazards (list) – list of hazards

  • figsize (tuple, optional) – size of the plot in inches. The default is =(8,8.025).

  • fontsize (int, optional) – fontsize for the plot. The default is 16.

Return type:

None.

mika.kd.trend_analysis.sample_for_accuracy(preprocessed_df, id_col, text_col, hazards, save_path, num_sample=100)

Sample for accuracy

Generates a spreadsheet of randomly sampled documents to analyze the quality of hazard extraction.

Parameters:
  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents

  • id_col (string) – column in preprocessed_df containing document ids

  • text_col (string) – column in preprocessed_df containing text

  • hazards (list) – list of hazards

  • save_path (string) – location to save the file to

  • num_sample (int, optional) – number of documents to sample. The default is 100.

Returns:

sampled_df – dataframe of sampled documents

Return type:

pandas DataFrame

mika.kd.trend_analysis.set_up_docs_per_hazard_vars(preprocessed_df, id_field, hazards, time_field)

Set up documents per hazard variables

Instantiates variables used to find the documents per hazard.

Parameters:
  • preprocessed_df (pandas DataFrame) – pandas datframe containing documents

  • id_field (string) – the column in preprocessed df that contains document ids

  • hazards (list) – list of hazards

  • time_field (string) – the column in preprocessed df that contains document time values, such as report year

Returns:

  • docs (list) – list of document ids.

  • frequency (Dict) – Nested dictionary used to store hazard frequencies. Keys are hazards and inner dict keys are years, values are ints.

  • docs_per_hazard (Dict) – nested dictionary used to store documents per hazard. Keys are hazards and value is an inner dict. Inner dict has keys as time variables (e.g., years) and values are lists.

  • hazard_words_per_doc (Dict) – used to store the hazard words per document. keys are hazards and values are lists with an element for each document.