utils package

mika.utils.Data

class mika.utils.Data(name='')

Utility for loading and preprocessing datasets to be used in MIKA analyses. Data should be loaded and processed as needed using this class before using methods from KD and IR classes.

Variables:

name (string, optional) – name of the dataset (default is “”)

load(filename, preprocessed=False, id_col=None, text_columns=[], name='', load_kwargs={}, preprocessed_kwargs={})

Loads in data, either preprocessed or raw.

Parameters:
  • filename (string) – filename for where the data is stored.

  • preprocessed (Boolean, optional) – true if the data is preprocessed. The default is False.

  • id_col (string, optional) – the column in the dataset where the document ids are stored. The default is None.

  • text_columns (list, optional) – list of oclumns in the dataset that contain text. The default is [].

  • name (string, optional) – name of the dataset. The default is ‘’.

  • load_kwargs (dict, optional) – dictionary of kwargs for loading raw data. The default is {}.

  • preprocessed_kwargs (dict, optional) – dictionary of kwargs for loading preprocessed data. The default is {}.

Return type:

None.

prepare_data(combine_columns=[], remove_incomplete_rows=True, create_ids=False)

Prepares data by creating unique ids, removing and combining rows/cols as defined by user.

Parameters:
  • combine_columns (list, optional) – list of oclumns to combine. The default is [].

  • remove_incomplete_rows (boolean, optional) – true to remove incomplete rows. The default is True.

  • create_ids (boolean, optional) – true to create unique ids. The default is False.

Return type:

None.

preprocess_data(domain_stopwords=[], ngrams=True, ngram_range=3, threshold=15, min_count=5, quot_correction=False, spellcheck=False, segmentation=False, drop_short_docs_thres=3, percent=0.3, drop_na=False, save_words=[], drop_dups=False, min_word_len=2, max_word_len=15)

Preprocess data

Performs data preprocessing steps as defined by user.

Parameters:
  • domain_stopwords (list, optional) – list of domain specific stopwords. The default is [].

  • ngrams (boolean, optional) – true to generate ngrams. The default is True.

  • ngram_range (int, optional) – highest level for ngrams, i.e. most number of words to be considered. The default is 3.

  • threshold (int, optional) – threshold used in gensim phrases. The default is 15.

  • min_count (int, optional) – minimum count an ngram must occur in the corpus to be considered an ngram. The default is 5.

  • quot_correction (boolean, optional) – true to perform ‘quot’ normalization. The default is False.

  • spellcheck (boolean, optional) – true to use symspell spellchecker. The default is False.

  • segmentation (boolean, optional) – true to use word segmentation spell checker. The default is False.

  • drop_short_docs_thres (int, optional) – Threshold length of document used to drop short docs.. The default is 3.

  • percent (float, optional) – removes words in greater than or equal to this percent of documents. The default is 0.3.

  • drop_na (boolean, optional) – true to drop rows with nan values. The default is False.

  • save_words (list, optional) – list of words to save from frequent word removal. The default is [].

  • drop_dups (boolean, optional) – true to drop duplicate rows. The default is False.

  • min_word_len (int, optional) – minimum word length.. The default is 2.

  • max_word_len (int, optional) – maximum word length. The default is 15.

Returns:

correction_list – list of mispelled words and their corrections if spelling correcting is used.

Return type:

list

save(results_path='')

Saves preprocessed data.

Parameters:

results_path (string, optional) – path to save the data to. The default is “”.

Return type:

None.

sentence_tokenization()

Tokenizes each document in the dataset into sentences. Creates an updated data_df where each sentence has a separate row.

Return type:

None.