CWordTM Package

Data Folders

Submodules

cwordtm.meta module

cwordtm.meta.addin(func, *, timing=False, code=0)[source]

Adds additional features (showing timing information and source code) to a function at runtime. This adds two parameters (‘timing’ & ‘code’) to function ‘func’ at runtime. ‘timing’ is a flag indicating whether the execution time of the function is shown, and it is default to False. ‘code’ is an indicator determining if the source code of the function ‘func’ is shown and/or the function is invoked; ‘0’ indicates the function is executed but its source code is not shown, ‘1’ indicates the source code of the function is shown after execution, or ‘2’ indicates the source code of the function is shown without execution, and it is default to 0.

Parameters:: func (function) – The target function for inserting additiolnal features - timing information and showing code, default to None
Returns:: The wrapper function
Return type:: function

cwordtm.meta.addin_all(modname='cwordtm', *, timing=False, code=0)[source]

Applies ‘addin’ function to all functions of all sub-modules of a module at runtime.

Parameters:: modname (str, optional) – The target module of which all the functions are inserted additional features, default to ‘wordtm’

cwordtm.meta.addin_all_functions(submod, *, timing=False, code=0)[source]

Applies ‘addin’ function to all functions of a module at runtime.

Parameters:: submod (module) – The target sub-module of which all the functions are inserted additional features, default to None

cwordtm.meta.get_function(mod_name, submodules, func_name, *, timing=False, code=0)[source]

Gets the object of the function ‘func_name’ if it belongs to one of ‘submodules’ of the current top-level module.

Parameters:

mod_name (str) – The name of the source top-level module, default to None
submodules (list) – The list of names of the sub-modules of the top-level module
func_name (str) – The name of the function to be looked for

Returns:

The object of the target function, if any, otherwise None

Return type:

function

cwordtm.meta.get_module_info(detailed=False, *, timing=False, code=0)[source]

Gets the information of the module ‘cwordtm’.

Parameters:: detailed (bool, optional) – The flag indicating whether only function signature or detailed source code is shown, default to False
Returns:: The information of the module ‘cwordtm’
Return type:: str

cwordtm.meta.get_submodule_info(submodname, detailed=False, *, timing=False, code=0)[source]

Gets the information of the prescribed submodule of the module ‘cwordtm’.

Parameters:

submodname (str) – The name of the prescribed submodule, default to None
detailed (bool, optional) – The flag indicating whether only function signature or detailed source code is shown, default to False

Returns:

The information of the prescribed submodule

Return type:

str

cwordtm.pivot module

cwordtm.pivot.pivot(df, value='text', category='category', *, timing=False, code=0)[source]

Returns a pivot table from the DataFrame ‘df’ storing the input documents, grouped by the prescribed column.

Parameters:

df (pandas.DataFrame) – The DataFrame storing the input documents, default to None
value (str, optional) – The column to be grouped, default to ‘text’
category (str, optional) – The column to be the group-by column, default to ‘category’

Returns:

The pivot table of the input documents grouped by the prescribed column

Return type:

pandas.DataFrame

cwordtm.pivot.stat(df, chi=False, *, timing=False, code=0)[source]

Returns a pivot table from the DataFrame ‘df’ storing the input Scripture, with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.

Parameters:

df (pandas.DataFrame) – The DataFrame storing the input Scripture, default to None
chi (bool, optional) – If the value is True, assume the input text is in Chinese, otherwise, the input text is in English, default to False

Returns:

The pivot table of the input Scripture grouped by category (‘cat_no’)

Return type:

pandas.DataFrame

cwordtm.quot module

cwordtm.quot.extract_quotation(text, quot_marks, *, timing=False, code=0)[source]

Returns the text within a pair of quotation marks.

Parameters:

text (str) – The target text to be extracted, default to None
quot_marks (list) – A pair of quotation marks, [‘”’, ‘”’] for English text or [’『’, ‘』’] for Chinese text, default to None

Returns:

The text within a pair of quotation marks, if any, otherwise, an empty string

Return type:

str

cwordtm.quot.match_text(target, sent_tokens, lang, threshold, n=5, *, timing=False, code=0)[source]

Returns a list of tuples of the cosine similarity measures of the OT verse with target verse and the index of that OT verse in the DataFrame storing the prescribed OT Scripture.

Parameters:

target (str) – The target verse to be matched, default to None
sent_tokens (str) – The target verse to be matched, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese, otherwise, it is English, default to None
threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where the cosine similarity measure of a matched OT verse and the target verse should be greater this value, default to None
n (int, optional) – The upper bound of the number of matched verses, default to 5

Returns:

The list of tuples of the cosine smilarity measure and the index of the OT verse

Return type:

list

cwordtm.quot.match_verse(i, ot_list, otdf, df, book, chap, verse, lang, threshold, *, timing=False, code=0)[source]

Returns whether the target NT verse (book, chap, verse) can match a particular verse in the list of OT verses (ot_list), and prints the matched OT verse(s).

Parameters:

i (int) – The number of matched verses so far, default to None
ot_list (list) – The list of OT verses (str) to be matched, default to None
otdf (pandas.DataFrame) – The DataFrame storing the prescribed OT verses to be matched, default to None
df (pandas.DataFrame) – The DataFrame storing the collection of the target NT verses to be matched, default to None
book (str) – The Bible book short name (3 characters) of the target NT verse to be matched, default to None
chap (int) – The chapter number of the target NT verse to be matched, default to None
verse (int) – The verse number of the target NT verse to be matched, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to None
threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to None

Returns:

True if the target verse matched an OT verse, False otherwise

Return type:

bool

cwordtm.quot.show_quot(target, source='ot', lang='en', threshold=0.5, *, timing=False, code=0)[source]

Shows a collection of matched OT verses, if any, based on the prescribed collection of target NT verse and the threshold value.

Parameters:

target (pandas.DataFrame) – The collection of target NT verses to be matched, default to None
source (str, optional) – The string representing the collection of all or subset of OT verses to be matched, default to ‘ot’
lang (str, optional) – If the value is ‘en’, the processed language is assumed to be English otherwise, it is Chinese, default to ‘en
threshold (str, optional) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to 0.5

Returns:

The list of tuples of the cosine smilarity measure and the index of the OT verse

Return type:

list

cwordtm.quot.tokenize(sentence, *, timing=False, code=0)[source]

Returns a list of tokens from a Chinese sentence.

Parameters:: sentence (str) – The target text to be tokenized, default to None
Returns:: The generator object that storing the list of tokens extracted from the sentence
Return type:: generator

cwordtm.ta module

cwordtm.ta.get_sent_scores(sentences, diction, sent_len, *, timing=False, code=0) → dict[source]

Returns the dictionary of a list of sentences with their scores computed by their words.

Parameters:

sentences (list) – The list of sentences for computing their scores, default to None
diction (collections.Counter object) – The dictionary storing the collection of tokenized words with their frequencies
sent_len (int) – The maximun number of words in a sentence to be processed, default to None

Returns:

The list of sentences tokenized from the collection of document

Return type:

pandas.DataFrame

cwordtm.ta.get_sentences(docs, lang='en', *, timing=False, code=0)[source]

Returns the list of sentences tokenized from the collection of documents (df).

Parameters:

docs (pandas.DataFrame) – The input documents storing the Scripture, default to None
lang (str, optional) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to ‘en’

Returns:

The list of sentences tokenized from the collection of document

Return type:

list

cwordtm.ta.get_summary(sentences, sent_weight, threshold, sent_len, *, timing=False, code=0)[source]

Returns the summary of the collection of sentences.

Parameters:

sentences (list) – The list of target sentences for summarization, default to None
sent_weight (collections.Counter object) – The dictionary of a list of sentences with their scores computed by their words
threshold (float) – The minimum value of sentence weight for extracting that sentence as part of the final summary, default to None
sent_len (int) – The maximun number of words in a sentence to be processed, default to None

Returns:

The list of sentences of the extractive summary

Return type:

list

cwordtm.ta.preprocess_sent(text, *, timing=False, code=0)[source]

Preprocesses English text by tokenizing text into sentences of words, converting text to lower case, removing stopwords, lemmatize text, and tagging text with Part-of-Speech (POS).

Parameters:: text (str) – The text to be preprocessed, default to None
Returns:: The list of preprocessed and tagged sentences (word, pos)
Return type:: list of tuples (str, str)

cwordtm.ta.split_chi_sentences(text, *, timing=False, code=0)[source]

cwordtm.ta.summary_chi(docs, weight=1.5, sent_len=8, *, timing=False, code=0)[source]

Returns an extractive summary of a collection of Chinese sentences.

Parameters:

docs (pandas.DataFrame or pandas.Series or numpy.ndarray or list) – The collection of target documents for summarization, default to None
weight (float, optional) – The factor to be multiplied to the threshold, which determines the sentences as the summary, default to 1.5
sent_len (int, optional) – The maximun number of words in a sentence to be processed, default to 8

Returns:

The list of sentences of the extractive summary

Return type:

list

cwordtm.ta.summary_en(docs, sent_len=8, *, timing=False, code=0)[source]

Returns an extractive summary of a collection of English sentences.

Parameters:

docs (pandas.DataFrame or pandas.Series or numpy.ndarray or list or text) – The collection of target documents for summarization, default to None
sent_len (int, optional) – The maximun number of words in a sentence to be processed, default to 8

Returns:

The list of sentences of the extractive summary

Return type:

list

cwordtm.tm module

class cwordtm.tm.BTM(doc_file, num_topics, chi=False, embed=True)[source]

Bases: object

The BTM object for BERTopic modeling.

Variables:

num_topics (int) – The number of topics to be modeled, default to 10
doc_file (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (bertopic.BERTopic) – The BERTopic model object
embed (bool) – The flag indicating whether the BERTopic model is trained with the BERT pretrained model
bmodel (transformers.BertModel) – The BERT pretrained model
bt_vectorizer (sklearn.feature_extraction.text.CountVectorizer) – The vectorizer extracted from the BERTopic model for model evaluation
bt_analyzer (functools.partial) – The analyzer extracted from the BERTopic model for model evaluation
cleaned_docs (list) – The list of documents (string) built by grouping the original documents by the topics created from the BERTopic model
too_few (bool) – The flag indicating whether there are too few documents to fit the BERTopic model
figures (list(tuple(matplotlib.pyplot.figure))) – The list of tuples (figure type, figure) of model visualization figures

__dict__ = mappingproxy({'__module__': 'cwordtm.tm', '__doc__': "The BTM object for BERTopic modeling.\n\n :cvar num_topics: The number of topics to be modeled, default to 10\n :vartype num_topics: int\n :ivar doc_file: The filename of the text file to be processed\n :vartype doc_file: str\n :ivar chi: The flag indicating whether the processed text is in Chinese or not,\n True stands for Traditional Chinese or False for English\n :vartype chi: bool\n :ivar num_topics: The number of topics set for the topic model\n :vartype num_topics: int\n :ivar docs: The collection of the original documents to be processed\n :vartype docs: pandas.DataFrame or list\n :ivar pro_docs: The collection of documents, in form of list of lists of words\n after text preprocessing\n :vartype pro_docs: list\n :ivar dictionary: The dictionary of word ids with their tokenized words\n from preprocessed documents ('pro_docs')\n :vartype dictionary: gensim.corpora.Dictionary\n :ivar corpus: The list of documents, where each document is a list of tuples\n (word id, word frequency in the particular document)\n :vartype corpus: list\n :ivar model: The BERTopic model object\n :vartype model: bertopic.BERTopic\n :ivar embed: The flag indicating whether the BERTopic model is trained\n with the BERT pretrained model\n :vartype embed: bool\n :ivar bmodel: The BERT pretrained model\n :vartype bmodel: transformers.BertModel\n :ivar bt_vectorizer: The vectorizer extracted from the BERTopic model\n for model evaluation\n :vartype bt_vectorizer: sklearn.feature_extraction.text.CountVectorizer\n :ivar bt_analyzer: The analyzer extracted from the BERTopic model\n for model evaluation\n :vartype bt_analyzer: functools.partial\n :ivar cleaned_docs: The list of documents (string) built by grouping\n the original documents by the topics created from the BERTopic model\n :vartype cleaned_docs: list\n :ivar too_few: The flag indicating whether there are too few documents\n to fit the BERTopic model\n :vartype too_few: bool\n :ivar figures: The list of tuples (figure type, figure) of \n model visualization figures\n :vartype figures: list(tuple(matplotlib.pyplot.figure))\n ", '__init__': <function BTM.__init__>, 'preprocess': <function BTM.preprocess>, 'preprocess_chi': <function BTM.preprocess_chi>, 'fit': <function BTM.fit>, 'fit_chi': <function BTM.fit_chi>, 'show_topics': <function BTM.show_topics>, 'pre_evaluate': <function BTM.pre_evaluate>, 'evaluate': <function BTM.evaluate>, 'viz': <function BTM.viz>, 'save': <function BTM.save>, 'load': <function BTM.load>, '__dict__': <attribute '__dict__' of 'BTM' objects>, '__weakref__': <attribute '__weakref__' of 'BTM' objects>, '__annotations__': {}})

__init__(doc_file, num_topics, chi=False, embed=True)[source]: Constructor method.

__module__ = 'cwordtm.tm'

__weakref__: list of weak references to the object (if defined)

evaluate()[source]: Computes and outputs the coherence score.

fit()[source]: Build the BERTopic model for English text with the created corpus and dictionary.

fit_chi()[source]: Build the BERTopic model for Chinese text with the created corpus and dictionary.

load(file)[source]

Loads the stored BERTopic model from the specified file.

Parameters:: file (str) – The name of the file to be loaded, default to None
Returns:: The loaded BERTopic model
Return type:: bertopic._bertopic.BERTopic

pre_evaluate()[source]: Prepare the original documents per built topic for model evaluation.

preprocess()[source]: Process the original English documents (cwordtm.tm.BTM.docs) by invoking cwordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the BERTopic model.

preprocess_chi()[source]: Process the original Chinese documents (cwordtm.tm.BTM.docs) by tokenizing text, removing stopwords, and building a dictionary and a corpus from the preprocessed documents for the BERTopic model.

save(file)[source]

Saves the built BERTopic model to the specified file.

Parameters:: file (str) – The name of the file to store the built model, default to None

show_topics()[source]: Shows the topics with their keywords from the built BERTopic model.

viz(web_app=False)[source]

Visualize the built BERTopic model through Intertopic Distance Map, Topic Word Score Charts, and Topic Similarity Matrix.

Parameters:: web_app (bool) – The flag indicating the function is initiated from a web application, default to False

class cwordtm.tm.LDA(doc_file, num_topics, chi=False)[source]

Bases: object

The LDA object for Latent Dirichlet Allocation (LDA) modeling.

Variables:

num_topics (int) – The number of topics to be modeled, default to 10
doc_file (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (gensim.models.LdaModel) – The LDA model object
vis_data (pyLDAvis.PreparedData) – The LDA model’s prepared data for visualization

__dict__ = mappingproxy({'__module__': 'cwordtm.tm', '__doc__': "The LDA object for Latent Dirichlet Allocation (LDA) modeling.\n \n :cvar num_topics: The number of topics to be modeled, default to 10\n :vartype num_topics: int\n :ivar doc_file: The filename of the text file to be processed\n :vartype doc_file: str\n :ivar chi: The flag indicating whether the processed text is in Chinese or not,\n True stands for Traditional Chinese or False for English\n :vartype chi: bool\n :ivar num_topics: The number of topics set for the topic model\n :vartype num_topics: int\n :ivar docs: The collection of the original documents to be processed\n :vartype docs: pandas.DataFrame or list\n :ivar pro_docs: The collection of documents, in form of list of lists of words\n after text preprocessing\n :vartype pro_docs: list\n :ivar dictionary: The dictionary of word ids with their tokenized words\n from preprocessed documents ('pro_docs')\n :vartype dictionary: gensim.corpora.Dictionary\n :ivar corpus: The list of documents, where each document is a list of tuples\n (word id, word frequency in the particular document)\n :vartype corpus: list\n :ivar model: The LDA model object\n :vartype model: gensim.models.LdaModel\n :ivar vis_data: The LDA model's prepared data for visualization\n :vartype vis_data: pyLDAvis.PreparedData\n ", '__init__': <function LDA.__init__>, 'preprocess': <function LDA.preprocess>, 'preprocess_chi': <function LDA.preprocess_chi>, 'fit': <function LDA.fit>, 'viz': <function LDA.viz>, 'show_topics': <function LDA.show_topics>, 'evaluate': <function LDA.evaluate>, 'save': <function LDA.save>, 'load': <function LDA.load>, '__dict__': <attribute '__dict__' of 'LDA' objects>, '__weakref__': <attribute '__weakref__' of 'LDA' objects>, '__annotations__': {}})

__init__(doc_file, num_topics, chi=False)[source]: Constructor method.

__module__ = 'cwordtm.tm'

__weakref__: list of weak references to the object (if defined)

evaluate()[source]: Computes and outputs the coherence score, perplexity, topic diversity, and topic size distribution.

fit()[source]: Build the LDA model with the created corpus and dictionary.

load(file)[source]

Loads the stored LDA model from the specified file.

Parameters:: file (str) – The name of the file to be loaded, default to None
Returns:: The loaded LDA model
Return type:: gensim.models.LdaModel

preprocess()[source]: Process the original English documents (cwordtm.tm.LDA.docs) by invoking cwordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the LDA model.

preprocess_chi()[source]: Process the original Chinese documents (cwordtm.tm.LDA.docs) by tokenizing text, removing stopwords, and building a dictionary and a corpus from the preprocessed documents for the LDA model.

save(file)[source]

Saves the built LDA model to the specified file.

Parameters:: file (str) – The name of the file to store the built model, default to None

show_topics()[source]: Shows the topics with their keywords from the built LDA model.

viz(web_app=False)[source]

Shows the Intertopic Distance Map for the built LDA model.

Parameters:: web_app (bool) – The flag indicating the function is initiated from a web application, default to False

class cwordtm.tm.NMF(doc_file, num_topics, chi=False)[source]

Bases: object

The NMF object for Non-negative Matrix Factorization (NMF) modeling.

Variables:

num_topics (int) – The number of topics to be modeled, default to 10
doc_file (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (gensim.models.Nmf) – The NMF model object
figures (list(matplotlib.pyplot.figure)) – The list of model visualization figures

__dict__ = mappingproxy({'__module__': 'cwordtm.tm', '__doc__': "The NMF object for Non-negative Matrix Factorization (NMF) modeling.\n\n :cvar num_topics: The number of topics to be modeled, default to 10\n :vartype num_topics: int\n :ivar doc_file: The filename of the text file to be processed\n :vartype doc_file: str\n :ivar chi: The flag indicating whether the processed text is in Chinese or not,\n True stands for Traditional Chinese or False for English\n :vartype chi: bool\n :ivar num_topics: The number of topics set for the topic model\n :vartype num_topics: int\n :ivar docs: The collection of the original documents to be processed\n :vartype docs: pandas.DataFrame or list\n :ivar pro_docs: The collection of documents, in form of list of lists of words\n after text preprocessing\n :vartype pro_docs: list\n :ivar dictionary: The dictionary of word ids with their tokenized words\n from preprocessed documents ('pro_docs')\n :vartype dictionary: gensim.corpora.Dictionary\n :ivar corpus: The list of documents, where each document is a list of tuples\n (word id, word frequency in the particular document)\n :vartype corpus: list\n :ivar model: The NMF model object\n :vartype model: gensim.models.Nmf\n :ivar figures: The list of model visualization figures\n :vartype figures: list(matplotlib.pyplot.figure)\n ", '__init__': <function NMF.__init__>, 'preprocess': <function NMF.preprocess>, 'preprocess_chi': <function NMF.preprocess_chi>, 'fit': <function NMF.fit>, 'show_topics_words': <function NMF.show_topics_words>, 'viz': <function NMF.viz>, 'evaluate': <function NMF.evaluate>, 'save': <function NMF.save>, 'load': <function NMF.load>, '__dict__': <attribute '__dict__' of 'NMF' objects>, '__weakref__': <attribute '__weakref__' of 'NMF' objects>, '__annotations__': {}})

__init__(doc_file, num_topics, chi=False)[source]: Constructor method.

__module__ = 'cwordtm.tm'

__weakref__: list of weak references to the object (if defined)

evaluate()[source]: Computes and outputs the coherence score, topic diversity, and topic size distribution.

fit()[source]: Build the NMF model with the created corpus and dictionary.

load(file)[source]

Loads the stored NMF model from the specified file.

Parameters:: file (str) – The name of the file to be loaded, default to None
Returns:: The loaded NMF model and the loaded dictionary of the NMF’s corpus
Return type:: gensim.models.Nmf, gensim.corpora.Dictionary

preprocess()[source]: Process the original English documents (cwordtm.tm.NMF.docs) by invoking cwordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the NMF model.

preprocess_chi()[source]: Process the original Chinese documents (cwordtm.tm.NMF.docs) by tokenizing text, removing stopwords, and building a dictionary and a corpus from the preprocessed documents for the NMF model.

save(file)[source]

Saves the built NMF model to the specified file.

Parameters:: file (str) – The name of the file to store the built model, default to None

show_topics_words()[source]: Shows the topics with their keywords from the built NMF model.

viz(web_app=False)[source]

Plot the topic distributions as a stacked bar chart for the built NMF model.

Parameters:: web_app (bool) – The flag indicating the function is initiated from a web application, default to False

cwordtm.tm.btm_process(doc_file, num_topics=10, source=0, text_col='text', doc_size=0, cat=0, chi=False, group=True, eval=False, web_app=False, *, timing=False, code=0)[source]

Pipelines the BERTopic modeling.

Parameters:

doc_file (str or io.BytesIO) – The filename of the prescribed text file to be loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
num_topics (int, optional) – The number of topics to be modeled, default to 10
source (int, optional) – The source of the prescribed document file (‘doc_file’), where 0 refers to internal store of the package and 1 to external file, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
web_app (bool) – The flag indicating the function is initiated from a web application, default to False

Returns:

The pipelined BTM

Return type:

cwordtm.tm.BTM object

cwordtm.tm.lda_process(doc_file, num_topics=10, source=0, text_col='text', doc_size=0, cat=0, chi=False, group=True, eval=False, web_app=False, *, timing=False, code=0)[source]

Pipelines the LDA modeling.

Parameters:

doc_file (str or io.BytesIO) – The filename of the prescribed text file to be loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
num_topics (int, optional) – The number of topics to be modeled, default to 10
source (int, optional) – The source of the prescribed document file (‘doc_file’), where 0 refers to internal store of the package and 1 to external file, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
web_app (bool) – The flag indicating the function is initiated from a web application, default to False

Returns:

The pipelined LDA

Return type:

cwordtm.tm.LDA object

cwordtm.tm.load_bible(textfile, cat=0, group=True, *, timing=False, code=0)[source]

Loads and returns the Bible Scripture from the prescribed internal file (‘textfile’).

Parameters:

textfile (str) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional) (‘cuv.csv’), default to None
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True

Returns:

The collection of Scripture loaded

Return type:

pandas.DataFrame

cwordtm.tm.load_text(textfile, doc_size=0, text_col='text', *, timing=False, code=0)[source]

Loads and returns the list of documents from the prescribed file (‘textfile’).

Parameters:

textfile (str) – The prescribed text file from which the text is loaded, default to None
nr (int, optional) – The number of rows of text to be loaded; 0 represents all rows, default to 0
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’

Returns:

The list of documents loaded

Return type:

list

cwordtm.tm.nmf_process(doc_file, num_topics=10, source=0, text_col='text', doc_size=0, cat=0, chi=False, group=True, eval=False, web_app=False, *, timing=False, code=0)[source]

Pipelines the NMF modeling.

Parameters:

doc_file (str or io.BytesIO) – The filename of the prescribed text file to be loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
num_topics (int, optional) – The number of topics to be modeled, default to 10
source (int, optional) – The source of the prescribed document file (‘doc_file’), where 0 refers to internal store of the package and 1 to external file, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
web_app (bool) – The flag indicating the function is initiated from a web application, default to False

Returns:

The pipelined NMF

Return type:

cwordtm.tm.NMF object

cwordtm.tm.process_text(doc, *, timing=False, code=0)[source]

Processes the English text through tokenization, converting to lower case, removing all digits, stemming, and removing punctuations and stopwords.

Parameters:: doc (str) – The prescribed text, in form of a string, to be processed, default to None
Returns:: The list of the processed strings
Return type:: list

cwordtm.util module

cwordtm.util.add_chi_vocab(*, timing=False, code=0)[source]: Loads the Chinese Bible vocabulary from the internal file ‘bible_vocab.txt’, and adds to the Jieba word list for future tokenization

cwordtm.util.bible_cat_info(lang='en', *, timing=False, code=0)[source]

Prints a table of Bible book categories with their books.

Parameters:: lang (str, optional) – The language of the information to be shown, default to “en”
Returns:: The table of Bible book categories
Return type:: pandas.DataFrame

cwordtm.util.chi_sent_terms(text, *, timing=False, code=0)[source]

Returns the list of Chinese words tokenized from the input text.

Parameters:: text (str) – The input Chinese text to be tokenized, default to None
Returns:: The list of Chinese words
Return type:: list

cwordtm.util.chi_stops(*, timing=False, code=0)[source]

Loads the common Chinese (Traditional) vocabulary to Jieba for future tokenization, and the Chinese stopwords for future wordcloud plotting.

Returns:: The list of stopwords for wordcloud plotting
Return type:: list

cwordtm.util.clean_sentences(sentences, *, timing=False, code=0)[source]

Cleans the list of sentences by invoking the function preprocess_text.

Parameters:: sentences (list) – The list of sentences to be cleaned, default to None
Returns:: The list of cleaned sentences
Return type:: list

cwordtm.util.clean_text(df, text_col='text', *, timing=False, code=0)[source]

Cleans the text from the Scripture stored in the DataFrame ‘df’, by removing all digits, replacing newline by a space, removing English stopwords, converting all characters to lower case, and removing all characters except alphanumeric and whitespace.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’

Returns:

The cleaned text in a DataFrame

Return type:

pandas.DataFrame

cwordtm.util.extract(df, testament=-1, category='', book=0, chapter=0, verse=0, *, timing=False, code=0)[source]

Extracts a subset of the Scripture stored in a DataFrame by testament, category, or book/chapter/verse.

Parameters:

df (pandas.DataFrame) – The collection of the Bible Scripture with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’, default to None
testament (int, optional) – The prescribed testament to be extracted, -1 stands for no prescription, 0 for OT, or 1 for NT, default to -1
category (str, optional) – The prescribed category to be extracted, and it should be either a full category name or a short name with 3 lower-case letters from a list of 10 categories, default to ‘’
book (str, int, optional) – The prescribed Bible book to be extracted, and it should be either a 3-letter short book name or a book number from 1 to 66, default to 0
chapter (int or tuple, optional) – The prescribed chapter or a tuple indicating the range of chapters of a Bible book to be extracted, default to 0
verse (int or tuple, optional) – The prescribed verse or a tuple indicating the range of verses from a chapter of a Bible book to be extracted, default to 0

Returns:

The subset of the input Scripture, if any, otherwise, the message ‘No scripture is extracted!’

Return type:

pandas.DataFrame or str

cwordtm.util.extract2(df, filter='', *, timing=False, code=0)[source]

Extracts a subset of the Scripture through a specific filter string by invoking the function ‘util.extract’.

Parameters:

df (pandas.DataFrame) – The collection of the Bible Scripture, default to None
filter (str, optional) – The prescribed filter string with the format ‘<book> <chapter>:<verse>[-<verse2>]’ for extracting a range of verses in the Scripture, default to ‘’

Returns:

The prescribed range of verses from the input Scripture, or the whole Scripture if the filter string is empty

Return type:

pandas.DataFrame

cwordtm.util.get_diction(docs, *, timing=False, code=0)[source]

Determines which is the target language, English or Chinese, in order to build a dictionary of words with their frequencies.

Parameters:: docs (pandas.DataFrame or list) – The collection of documents, default to None
Returns:: The dictionary of words with their frequencies
Return type:: dict

cwordtm.util.get_diction_chi(docs, *, timing=False, code=0)[source]

Tokenizes the collection of Chinese documents and builds a dictionary of words with their frequencies.

Parameters:: docs (pandas.DataFrame or list) – The collection of documents, default to None
Returns:: The dictionary of words with their frequencies
Return type:: dict

cwordtm.util.get_diction_en(docs, *, timing=False, code=0)[source]

Tokenizes the collection of English documents and builds a dictionary of words with their frequencies.

Parameters:: docs (pandas.DataFrame or list) – The collection of text, default to None
Returns:: The dictionary of words with their frequencies
Return type:: dict

cwordtm.util.get_list(df, column='book', *, timing=False, code=0)[source]

Extracts and returns the prescribed column from the Scripture stored in the DataFrame ‘df’.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
column (str, optional) – The column by which the Scriture is grouped, default to ‘book’

Returns:

The grouped Scripture

Return type:

pandas.DataFrame

cwordtm.util.get_sent_terms(text, *, timing=False, code=0)[source]

Determines how to tokenize the input text, based on the global language setting, either English (‘en’) or Traditional Chinese (‘chi’).

Parameters:: text (str) – The input text to be tokenized, default to None
Returns:: The list of tokenized words
Return type:: list

cwordtm.util.get_text(df, text_col='text', *, timing=False, code=0)[source]

Extracts and returns the text from a DataFrame stored in the DataFrame ‘df’ after joining the list of text into a string and removing all the ideographic spaces (’　‘) from the text.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’

Returns:

The extracted text

Return type:

str

cwordtm.util.get_text_list(df, text_col='text', *, timing=False, code=0)[source]

Extracts and returns the list of text from a DataFrame stored in the DataFrame ‘df’ after removing all the ideographic spaces (’　‘) from the text.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’

Returns:

The extracted text

Return type:

list

cwordtm.util.group_text(df, column='chapter', *, timing=False, code=0)[source]

Groups the Bible Scripture in the DataFrame ‘df’ by the prescribed column, and ‘df’ should include columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
column (str, optional) – The column by which the Scriture is grouped, default to ‘chapter’

Returns:

The grouped Scripture

Return type:

pandas.DataFrame

cwordtm.util.is_chi(*, timing=False, code=0)[source]

Checks whether the Chinese language flag is set.

Returns:: True if the Chinese language flag (chi_flag) is set, False otherwise
Return type:: bool

cwordtm.util.load_csv(file_obj, doc_size=0, info=False, *, timing=False, code=0)[source]

Loads a CSV file with a “text” column.

Parameters:

file_obj (str or io.BytesIO) – The prescribed file path from which the text is loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
doc_size (int, tuple, optional) – The number of documents to be loaded, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False

Returns:

The collection of text with the prescribed number of rows loaded

Return type:

pandas.DataFrame

cwordtm.util.load_text(file_obj, doc_size=0, info=False, *, timing=False, code=0)[source]

Loads and returns the text from the prescribed file path.

Parameters:

file_obj (str or io.BytesIO) – The prescribed file path from which the text is loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
doc_size (int, tuple, optional) – The number of documents to be loaded, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False

Returns:

The collection of text with the prescribed number of rows loaded

Return type:

pandas.DataFrame

cwordtm.util.load_word(ver='web.csv', nr=0, info=False, *, timing=False, code=0)[source]

Loads and returns the text from the prescribed internal file (‘ver’).

Parameters:

ver (str, optional) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional)(‘cuv.csv’), default to ‘web.csv’
nr (int, optional) – The number of rows of Scripture to be loaded; 0 represents all rows, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False

Returns:

The collection of Scripture with the prescribed number of rows loaded

Return type:

pandas.DataFrame

cwordtm.util.preprocess_text(text, *, timing=False, code=0)[source]

Preprocesses English text by converting text to lower case, removing special characters and digits, removing punctuations, removing stopwords, removing short words, and Lemmatize text.

Parameters:: text (str) – The text to be preprocessed, default to None
Returns:: The preprocessed text
Return type:: str

cwordtm.util.remove_noise(text, noise_list, *, timing=False, code=0)[source]

Removes a list of substrings in noise_list from the input text.

Parameters:

text (str) – The input text, default to None
noise_list (list, optional) – The list of substrings to be removed, default to “”

Returns:

The text with the prescribed substrings removed

Return type:

str

cwordtm.util.reset_rows(*, timing=False, code=0)[source]: Reset the maximum no. of rows of DataFrames to be displayed to its default value.

cwordtm.util.set_lang(lang='en', *, timing=False, code=0)[source]

Sets the prescribed language (English or Chinese (Traditional)) for further text processing.

Parameters:: lang (str, optional) – The prescribed language for text processing, where ‘en’ stands for English or ‘chi’ for Traditonal Chinese, default to ‘en’

cwordtm.util.set_rows(n=None, *, timing=False, code=0)[source]

Set the maximum no. of rows of DataFrames to be displayed.

Parameters:: n (int, optional) – The maximum no. of rows to be set, value None denotes that all rows are to be displayed, default to None

cwordtm.version module

cwordtm.version.__author__ = 'Johnny Cheng': Name of the author of the package

cwordtm.version.__copyright__ = 'Copyright (c) 2025 - Johnny Cheng': Copyright information

cwordtm.version.__credits__ = ['Jehovah, the Lord']: Credit information

cwordtm.version.__docs__ = 'https://cwordtm.readthedocs.io': Package documentation on “Read the Docs” website

cwordtm.version.__email__ = 'drjohnnycheng@gmail.com': Author’s email address

cwordtm.version.__url__ = 'https://github.com/drjohnnycheng/cwordtm.git': GitHub repository for the package

cwordtm.version.__version__ = '0.7.7': Version information

cwordtm.viz module

cwordtm.viz.chi_wordcloud(docs, figsize=(15, 10), bg='white', image=0, web_app=False, *, timing=False, code=0)[source]

Prepare and show a Chinese wordcloud

Parameters:

docs (pandas.DataFrame) – The collection of Chinese documents for preparing a wordcloud, default to None
figsize (tuple, optional) – Size (width, height) of word cloud, default to (15, 10)
bg (str, optional) – The background color (name) of the wordcloud, default to ‘white’
image (int or str or BytesIO, optional) – The filename of the presribed image as the mask of the wordcloud, or 1/2/3/4 for using an internal image (heart / disc / triangle / arrow), default to 0 (No image mask)
web_app (bool) – The flag indicating the function is initiated from a web application, default to False

Returns:

The wordcloud figure

Return type:

matplotlib.pyplot.figure

cwordtm.viz.plot_cloud(wordcloud, figsize, web_app=False, *, timing=False, code=0)[source]

Plot the prepared ‘wordcloud’

Parameters:

wordcloud (WordCloud object) – The WordCloud object for plotting, default to None
figsize (tuple) – Size (width, height) of word cloud, default to None
web_app (bool) – The flag indicating the function is initiated from a web application, default to False

Returns:

The wordcloud figure

Return type:

matplotlib.pyplot.figure

cwordtm.viz.show_wordcloud(docs, clean=False, figsize=(12, 8), bg='white', image=0, web_app=False, *, timing=False, code=0)[source]

Prepare and show a wordcloud

Parameters:

docs (pandas.DataFrame) – The collection of documents for preparing a wordcloud, default to None
clean (bool, optional) – The flag whether text preprocessing is needed, default to False
figsize (tuple, optional) – Size (width, height) of word cloud, default to (12, 8)
bg (str, optional) – The background color (name) of the wordcloud, default to ‘white’
image (int or str or BytesIO, optional) – The filename of the presribed image as the mask of the wordcloud, or 1/2/3/4 for using an internal image (heart / disc / triangle / arrow), default to 0 (No image mask)
web_app (bool) – The flag indicating the function is initiated from a web application, default to False

Returns:

The wordcloud figure

Return type:

matplotlib.pyplot.figure

CWordTM Package

Data Folders

Submodules

cwordtm.meta module

cwordtm.pivot module

cwordtm.quot module

cwordtm.ta module

cwordtm.tm module

cwordtm.util module

cwordtm.version module

cwordtm.viz module

Module contents