CWordTM Package
Data Folders
Submodules
cwordtm.meta module
- cwordtm.meta.addin(func, *, timing=False, code=0)[source]
Adds additional features (showing timing information and source code) to a function at runtime. This adds two parameters (‘timing’ & ‘code’) to function ‘func’ at runtime. ‘timing’ is a flag indicating whether the execution time of the function is shown, and it is default to False. ‘code’ is an indicator determining if the source code of the function ‘func’ is shown and/or the function is invoked; ‘0’ indicates the function is executed but its source code is not shown, ‘1’ indicates the source code of the function is shown after execution, or ‘2’ indicates the source code of the function is shown without execution, and it is default to 0.
- Parameters:
func (function) – The target function for inserting additiolnal features - timing information and showing code, default to None
- Returns:
The wrapper function
- Return type:
function
- cwordtm.meta.addin_all(modname='cwordtm', *, timing=False, code=0)[source]
Applies ‘addin’ function to all functions of all sub-modules of a module at runtime.
- Parameters:
modname (str, optional) – The target module of which all the functions are inserted additional features, default to ‘wordtm’
- cwordtm.meta.addin_all_functions(submod, *, timing=False, code=0)[source]
Applies ‘addin’ function to all functions of a module at runtime.
- Parameters:
submod (module) – The target sub-module of which all the functions are inserted additional features, default to None
- cwordtm.meta.get_function(mod_name, submodules, func_name, *, timing=False, code=0)[source]
Gets the object of the function ‘func_name’ if it belongs to one of ‘submodules’ of the current top-level module.
- Parameters:
mod_name (str) – The name of the source top-level module, default to None
submodules (list) – The list of names of the sub-modules of the top-level module
func_name (str) – The name of the function to be looked for
- Returns:
The object of the target function, if any, otherwise None
- Return type:
function
- cwordtm.meta.get_module_info(detailed=False, *, timing=False, code=0)[source]
Gets the information of the module ‘cwordtm’.
- Parameters:
detailed (bool, optional) – The flag indicating whether only function signature or detailed source code is shown, default to False
- Returns:
The information of the module ‘cwordtm’
- Return type:
str
- cwordtm.meta.get_submodule_info(submodname, detailed=False, *, timing=False, code=0)[source]
Gets the information of the prescribed submodule of the module ‘cwordtm’.
- Parameters:
submodname (str) – The name of the prescribed submodule, default to None
detailed (bool, optional) – The flag indicating whether only function signature or detailed source code is shown, default to False
- Returns:
The information of the prescribed submodule
- Return type:
str
cwordtm.pivot module
- cwordtm.pivot.pivot(df, value='text', category='category', *, timing=False, code=0)[source]
Returns a pivot table from the DataFrame ‘df’ storing the input documents, grouped by the prescribed column.
- Parameters:
df (pandas.DataFrame) – The DataFrame storing the input documents, default to None
value (str, optional) – The column to be grouped, default to ‘text’
category (str, optional) – The column to be the group-by column, default to ‘category’
- Returns:
The pivot table of the input documents grouped by the prescribed column
- Return type:
pandas.DataFrame
- cwordtm.pivot.stat(df, chi=False, *, timing=False, code=0)[source]
Returns a pivot table from the DataFrame ‘df’ storing the input Scripture, with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.
- Parameters:
df (pandas.DataFrame) – The DataFrame storing the input Scripture, default to None
chi (bool, optional) – If the value is True, assume the input text is in Chinese, otherwise, the input text is in English, default to False
- Returns:
The pivot table of the input Scripture grouped by category (‘cat_no’)
- Return type:
pandas.DataFrame
cwordtm.quot module
- cwordtm.quot.extract_quotation(text, quot_marks, *, timing=False, code=0)[source]
Returns the text within a pair of quotation marks.
- Parameters:
text (str) – The target text to be extracted, default to None
quot_marks (list) – A pair of quotation marks, [‘”’, ‘”’] for English text or [’『’, ‘』’] for Chinese text, default to None
- Returns:
The text within a pair of quotation marks, if any, otherwise, an empty string
- Return type:
str
- cwordtm.quot.match_text(target, sent_tokens, lang, threshold, n=5, *, timing=False, code=0)[source]
Returns a list of tuples of the cosine similarity measures of the OT verse with target verse and the index of that OT verse in the DataFrame storing the prescribed OT Scripture.
- Parameters:
target (str) – The target verse to be matched, default to None
sent_tokens (str) – The target verse to be matched, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese, otherwise, it is English, default to None
threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where the cosine similarity measure of a matched OT verse and the target verse should be greater this value, default to None
n (int, optional) – The upper bound of the number of matched verses, default to 5
- Returns:
The list of tuples of the cosine smilarity measure and the index of the OT verse
- Return type:
list
- cwordtm.quot.match_verse(i, ot_list, otdf, df, book, chap, verse, lang, threshold, *, timing=False, code=0)[source]
Returns whether the target NT verse (book, chap, verse) can match a particular verse in the list of OT verses (ot_list), and prints the matched OT verse(s).
- Parameters:
i (int) – The number of matched verses so far, default to None
ot_list (list) – The list of OT verses (str) to be matched, default to None
otdf (pandas.DataFrame) – The DataFrame storing the prescribed OT verses to be matched, default to None
df (pandas.DataFrame) – The DataFrame storing the collection of the target NT verses to be matched, default to None
book (str) – The Bible book short name (3 characters) of the target NT verse to be matched, default to None
chap (int) – The chapter number of the target NT verse to be matched, default to None
verse (int) – The verse number of the target NT verse to be matched, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to None
threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to None
- Returns:
True if the target verse matched an OT verse, False otherwise
- Return type:
bool
- cwordtm.quot.show_quot(target, source='ot', lang='en', threshold=0.5, *, timing=False, code=0)[source]
Shows a collection of matched OT verses, if any, based on the prescribed collection of target NT verse and the threshold value.
- Parameters:
target (pandas.DataFrame) – The collection of target NT verses to be matched, default to None
source (str, optional) – The string representing the collection of all or subset of OT verses to be matched, default to ‘ot’
lang (str, optional) – If the value is ‘en’, the processed language is assumed to be English otherwise, it is Chinese, default to ‘en
threshold (str, optional) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to 0.5
- Returns:
The list of tuples of the cosine smilarity measure and the index of the OT verse
- Return type:
list
- cwordtm.quot.tokenize(sentence, *, timing=False, code=0)[source]
Returns a list of tokens from a Chinese sentence.
- Parameters:
sentence (str) – The target text to be tokenized, default to None
- Returns:
The generator object that storing the list of tokens extracted from the sentence
- Return type:
generator
cwordtm.ta module
- cwordtm.ta.get_sent_scores(sentences, diction, sent_len, *, timing=False, code=0) dict[source]
Returns the dictionary of a list of sentences with their scores computed by their words.
- Parameters:
sentences (list) – The list of sentences for computing their scores, default to None
diction (collections.Counter object) – The dictionary storing the collection of tokenized words with their frequencies
sent_len (int) – The maximun number of words in a sentence to be processed, default to None
- Returns:
The list of sentences tokenized from the collection of document
- Return type:
pandas.DataFrame
- cwordtm.ta.get_sentences(docs, lang='en', *, timing=False, code=0)[source]
Returns the list of sentences tokenized from the collection of documents (df).
- Parameters:
docs (pandas.DataFrame) – The input documents storing the Scripture, default to None
lang (str, optional) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to ‘en’
- Returns:
The list of sentences tokenized from the collection of document
- Return type:
list
- cwordtm.ta.get_summary(sentences, sent_weight, threshold, sent_len, *, timing=False, code=0)[source]
Returns the summary of the collection of sentences.
- Parameters:
sentences (list) – The list of target sentences for summarization, default to None
sent_weight (collections.Counter object) – The dictionary of a list of sentences with their scores computed by their words
threshold (float) – The minimum value of sentence weight for extracting that sentence as part of the final summary, default to None
sent_len (int) – The maximun number of words in a sentence to be processed, default to None
- Returns:
The list of sentences of the extractive summary
- Return type:
list
- cwordtm.ta.preprocess_sent(text, *, timing=False, code=0)[source]
Preprocesses English text by tokenizing text into sentences of words, converting text to lower case, removing stopwords, lemmatize text, and tagging text with Part-of-Speech (POS).
- Parameters:
text (str) – The text to be preprocessed, default to None
- Returns:
The list of preprocessed and tagged sentences (word, pos)
- Return type:
list of tuples (str, str)
- cwordtm.ta.summary_chi(docs, weight=1.5, sent_len=8, *, timing=False, code=0)[source]
Returns an extractive summary of a collection of Chinese sentences.
- Parameters:
docs (pandas.DataFrame or pandas.Series or numpy.ndarray or list) – The collection of target documents for summarization, default to None
weight (float, optional) – The factor to be multiplied to the threshold, which determines the sentences as the summary, default to 1.5
sent_len (int, optional) – The maximun number of words in a sentence to be processed, default to 8
- Returns:
The list of sentences of the extractive summary
- Return type:
list
- cwordtm.ta.summary_en(docs, sent_len=8, *, timing=False, code=0)[source]
Returns an extractive summary of a collection of English sentences.
- Parameters:
docs (pandas.DataFrame or pandas.Series or numpy.ndarray or list or text) – The collection of target documents for summarization, default to None
sent_len (int, optional) – The maximun number of words in a sentence to be processed, default to 8
- Returns:
The list of sentences of the extractive summary
- Return type:
list
cwordtm.tm module
- class cwordtm.tm.BTM(doc_file, num_topics, chi=False, embed=True)[source]
Bases:
objectThe BTM object for BERTopic modeling.
- Variables:
num_topics (int) – The number of topics to be modeled, default to 10
doc_file (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (bertopic.BERTopic) – The BERTopic model object
embed (bool) – The flag indicating whether the BERTopic model is trained with the BERT pretrained model
bmodel (transformers.BertModel) – The BERT pretrained model
bt_vectorizer (sklearn.feature_extraction.text.CountVectorizer) – The vectorizer extracted from the BERTopic model for model evaluation
bt_analyzer (functools.partial) – The analyzer extracted from the BERTopic model for model evaluation
cleaned_docs (list) – The list of documents (string) built by grouping the original documents by the topics created from the BERTopic model
too_few (bool) – The flag indicating whether there are too few documents to fit the BERTopic model
figures (list(tuple(matplotlib.pyplot.figure))) – The list of tuples (figure type, figure) of model visualization figures
- __dict__ = mappingproxy({'__module__': 'cwordtm.tm', '__doc__': "The BTM object for BERTopic modeling.\n\n :cvar num_topics: The number of topics to be modeled, default to 10\n :vartype num_topics: int\n :ivar doc_file: The filename of the text file to be processed\n :vartype doc_file: str\n :ivar chi: The flag indicating whether the processed text is in Chinese or not,\n True stands for Traditional Chinese or False for English\n :vartype chi: bool\n :ivar num_topics: The number of topics set for the topic model\n :vartype num_topics: int\n :ivar docs: The collection of the original documents to be processed\n :vartype docs: pandas.DataFrame or list\n :ivar pro_docs: The collection of documents, in form of list of lists of words\n after text preprocessing\n :vartype pro_docs: list\n :ivar dictionary: The dictionary of word ids with their tokenized words\n from preprocessed documents ('pro_docs')\n :vartype dictionary: gensim.corpora.Dictionary\n :ivar corpus: The list of documents, where each document is a list of tuples\n (word id, word frequency in the particular document)\n :vartype corpus: list\n :ivar model: The BERTopic model object\n :vartype model: bertopic.BERTopic\n :ivar embed: The flag indicating whether the BERTopic model is trained\n with the BERT pretrained model\n :vartype embed: bool\n :ivar bmodel: The BERT pretrained model\n :vartype bmodel: transformers.BertModel\n :ivar bt_vectorizer: The vectorizer extracted from the BERTopic model\n for model evaluation\n :vartype bt_vectorizer: sklearn.feature_extraction.text.CountVectorizer\n :ivar bt_analyzer: The analyzer extracted from the BERTopic model\n for model evaluation\n :vartype bt_analyzer: functools.partial\n :ivar cleaned_docs: The list of documents (string) built by grouping\n the original documents by the topics created from the BERTopic model\n :vartype cleaned_docs: list\n :ivar too_few: The flag indicating whether there are too few documents\n to fit the BERTopic model\n :vartype too_few: bool\n :ivar figures: The list of tuples (figure type, figure) of \n model visualization figures\n :vartype figures: list(tuple(matplotlib.pyplot.figure))\n ", '__init__': <function BTM.__init__>, 'preprocess': <function BTM.preprocess>, 'preprocess_chi': <function BTM.preprocess_chi>, 'fit': <function BTM.fit>, 'fit_chi': <function BTM.fit_chi>, 'show_topics': <function BTM.show_topics>, 'pre_evaluate': <function BTM.pre_evaluate>, 'evaluate': <function BTM.evaluate>, 'viz': <function BTM.viz>, 'save': <function BTM.save>, 'load': <function BTM.load>, '__dict__': <attribute '__dict__' of 'BTM' objects>, '__weakref__': <attribute '__weakref__' of 'BTM' objects>, '__annotations__': {}})
- __module__ = 'cwordtm.tm'
- __weakref__
list of weak references to the object (if defined)
- fit_chi()[source]
Build the BERTopic model for Chinese text with the created corpus and dictionary.
- load(file)[source]
Loads the stored BERTopic model from the specified file.
- Parameters:
file (str) – The name of the file to be loaded, default to None
- Returns:
The loaded BERTopic model
- Return type:
bertopic._bertopic.BERTopic
- preprocess()[source]
Process the original English documents (cwordtm.tm.BTM.docs) by invoking cwordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the BERTopic model.
- preprocess_chi()[source]
Process the original Chinese documents (cwordtm.tm.BTM.docs) by tokenizing text, removing stopwords, and building a dictionary and a corpus from the preprocessed documents for the BERTopic model.
- class cwordtm.tm.LDA(doc_file, num_topics, chi=False)[source]
Bases:
objectThe LDA object for Latent Dirichlet Allocation (LDA) modeling.
- Variables:
num_topics (int) – The number of topics to be modeled, default to 10
doc_file (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (gensim.models.LdaModel) – The LDA model object
vis_data (pyLDAvis.PreparedData) – The LDA model’s prepared data for visualization
- __dict__ = mappingproxy({'__module__': 'cwordtm.tm', '__doc__': "The LDA object for Latent Dirichlet Allocation (LDA) modeling.\n \n :cvar num_topics: The number of topics to be modeled, default to 10\n :vartype num_topics: int\n :ivar doc_file: The filename of the text file to be processed\n :vartype doc_file: str\n :ivar chi: The flag indicating whether the processed text is in Chinese or not,\n True stands for Traditional Chinese or False for English\n :vartype chi: bool\n :ivar num_topics: The number of topics set for the topic model\n :vartype num_topics: int\n :ivar docs: The collection of the original documents to be processed\n :vartype docs: pandas.DataFrame or list\n :ivar pro_docs: The collection of documents, in form of list of lists of words\n after text preprocessing\n :vartype pro_docs: list\n :ivar dictionary: The dictionary of word ids with their tokenized words\n from preprocessed documents ('pro_docs')\n :vartype dictionary: gensim.corpora.Dictionary\n :ivar corpus: The list of documents, where each document is a list of tuples\n (word id, word frequency in the particular document)\n :vartype corpus: list\n :ivar model: The LDA model object\n :vartype model: gensim.models.LdaModel\n :ivar vis_data: The LDA model's prepared data for visualization\n :vartype vis_data: pyLDAvis.PreparedData\n ", '__init__': <function LDA.__init__>, 'preprocess': <function LDA.preprocess>, 'preprocess_chi': <function LDA.preprocess_chi>, 'fit': <function LDA.fit>, 'viz': <function LDA.viz>, 'show_topics': <function LDA.show_topics>, 'evaluate': <function LDA.evaluate>, 'save': <function LDA.save>, 'load': <function LDA.load>, '__dict__': <attribute '__dict__' of 'LDA' objects>, '__weakref__': <attribute '__weakref__' of 'LDA' objects>, '__annotations__': {}})
- __module__ = 'cwordtm.tm'
- __weakref__
list of weak references to the object (if defined)
- evaluate()[source]
Computes and outputs the coherence score, perplexity, topic diversity, and topic size distribution.
- load(file)[source]
Loads the stored LDA model from the specified file.
- Parameters:
file (str) – The name of the file to be loaded, default to None
- Returns:
The loaded LDA model
- Return type:
gensim.models.LdaModel
- preprocess()[source]
Process the original English documents (cwordtm.tm.LDA.docs) by invoking cwordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the LDA model.
- preprocess_chi()[source]
Process the original Chinese documents (cwordtm.tm.LDA.docs) by tokenizing text, removing stopwords, and building a dictionary and a corpus from the preprocessed documents for the LDA model.
- class cwordtm.tm.NMF(doc_file, num_topics, chi=False)[source]
Bases:
objectThe NMF object for Non-negative Matrix Factorization (NMF) modeling.
- Variables:
num_topics (int) – The number of topics to be modeled, default to 10
doc_file (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (gensim.models.Nmf) – The NMF model object
figures (list(matplotlib.pyplot.figure)) – The list of model visualization figures
- __dict__ = mappingproxy({'__module__': 'cwordtm.tm', '__doc__': "The NMF object for Non-negative Matrix Factorization (NMF) modeling.\n\n :cvar num_topics: The number of topics to be modeled, default to 10\n :vartype num_topics: int\n :ivar doc_file: The filename of the text file to be processed\n :vartype doc_file: str\n :ivar chi: The flag indicating whether the processed text is in Chinese or not,\n True stands for Traditional Chinese or False for English\n :vartype chi: bool\n :ivar num_topics: The number of topics set for the topic model\n :vartype num_topics: int\n :ivar docs: The collection of the original documents to be processed\n :vartype docs: pandas.DataFrame or list\n :ivar pro_docs: The collection of documents, in form of list of lists of words\n after text preprocessing\n :vartype pro_docs: list\n :ivar dictionary: The dictionary of word ids with their tokenized words\n from preprocessed documents ('pro_docs')\n :vartype dictionary: gensim.corpora.Dictionary\n :ivar corpus: The list of documents, where each document is a list of tuples\n (word id, word frequency in the particular document)\n :vartype corpus: list\n :ivar model: The NMF model object\n :vartype model: gensim.models.Nmf\n :ivar figures: The list of model visualization figures\n :vartype figures: list(matplotlib.pyplot.figure)\n ", '__init__': <function NMF.__init__>, 'preprocess': <function NMF.preprocess>, 'preprocess_chi': <function NMF.preprocess_chi>, 'fit': <function NMF.fit>, 'show_topics_words': <function NMF.show_topics_words>, 'viz': <function NMF.viz>, 'evaluate': <function NMF.evaluate>, 'save': <function NMF.save>, 'load': <function NMF.load>, '__dict__': <attribute '__dict__' of 'NMF' objects>, '__weakref__': <attribute '__weakref__' of 'NMF' objects>, '__annotations__': {}})
- __module__ = 'cwordtm.tm'
- __weakref__
list of weak references to the object (if defined)
- evaluate()[source]
Computes and outputs the coherence score, topic diversity, and topic size distribution.
- load(file)[source]
Loads the stored NMF model from the specified file.
- Parameters:
file (str) – The name of the file to be loaded, default to None
- Returns:
The loaded NMF model and the loaded dictionary of the NMF’s corpus
- Return type:
gensim.models.Nmf, gensim.corpora.Dictionary
- preprocess()[source]
Process the original English documents (cwordtm.tm.NMF.docs) by invoking cwordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the NMF model.
- preprocess_chi()[source]
Process the original Chinese documents (cwordtm.tm.NMF.docs) by tokenizing text, removing stopwords, and building a dictionary and a corpus from the preprocessed documents for the NMF model.
- cwordtm.tm.btm_process(doc_file, num_topics=10, source=0, text_col='text', doc_size=0, cat=0, chi=False, group=True, eval=False, web_app=False, *, timing=False, code=0)[source]
Pipelines the BERTopic modeling.
- Parameters:
doc_file (str or io.BytesIO) – The filename of the prescribed text file to be loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
num_topics (int, optional) – The number of topics to be modeled, default to 10
source (int, optional) – The source of the prescribed document file (‘doc_file’), where 0 refers to internal store of the package and 1 to external file, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
web_app (bool) – The flag indicating the function is initiated from a web application, default to False
- Returns:
The pipelined BTM
- Return type:
cwordtm.tm.BTM object
- cwordtm.tm.lda_process(doc_file, num_topics=10, source=0, text_col='text', doc_size=0, cat=0, chi=False, group=True, eval=False, web_app=False, *, timing=False, code=0)[source]
Pipelines the LDA modeling.
- Parameters:
doc_file (str or io.BytesIO) – The filename of the prescribed text file to be loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
num_topics (int, optional) – The number of topics to be modeled, default to 10
source (int, optional) – The source of the prescribed document file (‘doc_file’), where 0 refers to internal store of the package and 1 to external file, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
web_app (bool) – The flag indicating the function is initiated from a web application, default to False
- Returns:
The pipelined LDA
- Return type:
cwordtm.tm.LDA object
- cwordtm.tm.load_bible(textfile, cat=0, group=True, *, timing=False, code=0)[source]
Loads and returns the Bible Scripture from the prescribed internal file (‘textfile’).
- Parameters:
textfile (str) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional) (‘cuv.csv’), default to None
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
- Returns:
The collection of Scripture loaded
- Return type:
pandas.DataFrame
- cwordtm.tm.load_text(textfile, doc_size=0, text_col='text', *, timing=False, code=0)[source]
Loads and returns the list of documents from the prescribed file (‘textfile’).
- Parameters:
textfile (str) – The prescribed text file from which the text is loaded, default to None
nr (int, optional) – The number of rows of text to be loaded; 0 represents all rows, default to 0
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
- Returns:
The list of documents loaded
- Return type:
list
- cwordtm.tm.nmf_process(doc_file, num_topics=10, source=0, text_col='text', doc_size=0, cat=0, chi=False, group=True, eval=False, web_app=False, *, timing=False, code=0)[source]
Pipelines the NMF modeling.
- Parameters:
doc_file (str or io.BytesIO) – The filename of the prescribed text file to be loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
num_topics (int, optional) – The number of topics to be modeled, default to 10
source (int, optional) – The source of the prescribed document file (‘doc_file’), where 0 refers to internal store of the package and 1 to external file, default to 0
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
doc_size (int, tuple, optional) – The number of documents to be processed, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
web_app (bool) – The flag indicating the function is initiated from a web application, default to False
- Returns:
The pipelined NMF
- Return type:
cwordtm.tm.NMF object
- cwordtm.tm.process_text(doc, *, timing=False, code=0)[source]
Processes the English text through tokenization, converting to lower case, removing all digits, stemming, and removing punctuations and stopwords.
- Parameters:
doc (str) – The prescribed text, in form of a string, to be processed, default to None
- Returns:
The list of the processed strings
- Return type:
list
cwordtm.util module
- cwordtm.util.add_chi_vocab(*, timing=False, code=0)[source]
Loads the Chinese Bible vocabulary from the internal file ‘bible_vocab.txt’, and adds to the Jieba word list for future tokenization
- cwordtm.util.bible_cat_info(lang='en', *, timing=False, code=0)[source]
Prints a table of Bible book categories with their books.
- Parameters:
lang (str, optional) – The language of the information to be shown, default to “en”
- Returns:
The table of Bible book categories
- Return type:
pandas.DataFrame
- cwordtm.util.chi_sent_terms(text, *, timing=False, code=0)[source]
Returns the list of Chinese words tokenized from the input text.
- Parameters:
text (str) – The input Chinese text to be tokenized, default to None
- Returns:
The list of Chinese words
- Return type:
list
- cwordtm.util.chi_stops(*, timing=False, code=0)[source]
Loads the common Chinese (Traditional) vocabulary to Jieba for future tokenization, and the Chinese stopwords for future wordcloud plotting.
- Returns:
The list of stopwords for wordcloud plotting
- Return type:
list
- cwordtm.util.clean_sentences(sentences, *, timing=False, code=0)[source]
Cleans the list of sentences by invoking the function preprocess_text.
- Parameters:
sentences (list) – The list of sentences to be cleaned, default to None
- Returns:
The list of cleaned sentences
- Return type:
list
- cwordtm.util.clean_text(df, text_col='text', *, timing=False, code=0)[source]
Cleans the text from the Scripture stored in the DataFrame ‘df’, by removing all digits, replacing newline by a space, removing English stopwords, converting all characters to lower case, and removing all characters except alphanumeric and whitespace.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
- Returns:
The cleaned text in a DataFrame
- Return type:
pandas.DataFrame
- cwordtm.util.extract(df, testament=-1, category='', book=0, chapter=0, verse=0, *, timing=False, code=0)[source]
Extracts a subset of the Scripture stored in a DataFrame by testament, category, or book/chapter/verse.
- Parameters:
df (pandas.DataFrame) – The collection of the Bible Scripture with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’, default to None
testament (int, optional) – The prescribed testament to be extracted, -1 stands for no prescription, 0 for OT, or 1 for NT, default to -1
category (str, optional) – The prescribed category to be extracted, and it should be either a full category name or a short name with 3 lower-case letters from a list of 10 categories, default to ‘’
book (str, int, optional) – The prescribed Bible book to be extracted, and it should be either a 3-letter short book name or a book number from 1 to 66, default to 0
chapter (int or tuple, optional) – The prescribed chapter or a tuple indicating the range of chapters of a Bible book to be extracted, default to 0
verse (int or tuple, optional) – The prescribed verse or a tuple indicating the range of verses from a chapter of a Bible book to be extracted, default to 0
- Returns:
The subset of the input Scripture, if any, otherwise, the message ‘No scripture is extracted!’
- Return type:
pandas.DataFrame or str
- cwordtm.util.extract2(df, filter='', *, timing=False, code=0)[source]
Extracts a subset of the Scripture through a specific filter string by invoking the function ‘util.extract’.
- Parameters:
df (pandas.DataFrame) – The collection of the Bible Scripture, default to None
filter (str, optional) – The prescribed filter string with the format ‘<book> <chapter>:<verse>[-<verse2>]’ for extracting a range of verses in the Scripture, default to ‘’
- Returns:
The prescribed range of verses from the input Scripture, or the whole Scripture if the filter string is empty
- Return type:
pandas.DataFrame
- cwordtm.util.get_diction(docs, *, timing=False, code=0)[source]
Determines which is the target language, English or Chinese, in order to build a dictionary of words with their frequencies.
- Parameters:
docs (pandas.DataFrame or list) – The collection of documents, default to None
- Returns:
The dictionary of words with their frequencies
- Return type:
dict
- cwordtm.util.get_diction_chi(docs, *, timing=False, code=0)[source]
Tokenizes the collection of Chinese documents and builds a dictionary of words with their frequencies.
- Parameters:
docs (pandas.DataFrame or list) – The collection of documents, default to None
- Returns:
The dictionary of words with their frequencies
- Return type:
dict
- cwordtm.util.get_diction_en(docs, *, timing=False, code=0)[source]
Tokenizes the collection of English documents and builds a dictionary of words with their frequencies.
- Parameters:
docs (pandas.DataFrame or list) – The collection of text, default to None
- Returns:
The dictionary of words with their frequencies
- Return type:
dict
- cwordtm.util.get_list(df, column='book', *, timing=False, code=0)[source]
Extracts and returns the prescribed column from the Scripture stored in the DataFrame ‘df’.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
column (str, optional) – The column by which the Scriture is grouped, default to ‘book’
- Returns:
The grouped Scripture
- Return type:
pandas.DataFrame
- cwordtm.util.get_sent_terms(text, *, timing=False, code=0)[source]
Determines how to tokenize the input text, based on the global language setting, either English (‘en’) or Traditional Chinese (‘chi’).
- Parameters:
text (str) – The input text to be tokenized, default to None
- Returns:
The list of tokenized words
- Return type:
list
- cwordtm.util.get_text(df, text_col='text', *, timing=False, code=0)[source]
Extracts and returns the text from a DataFrame stored in the DataFrame ‘df’ after joining the list of text into a string and removing all the ideographic spaces (’ ‘) from the text.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
- Returns:
The extracted text
- Return type:
str
- cwordtm.util.get_text_list(df, text_col='text', *, timing=False, code=0)[source]
Extracts and returns the list of text from a DataFrame stored in the DataFrame ‘df’ after removing all the ideographic spaces (’ ‘) from the text.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
text_col (str, optional) – The name of the text column to be extracted, default to ‘text’
- Returns:
The extracted text
- Return type:
list
- cwordtm.util.group_text(df, column='chapter', *, timing=False, code=0)[source]
Groups the Bible Scripture in the DataFrame ‘df’ by the prescribed column, and ‘df’ should include columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
column (str, optional) – The column by which the Scriture is grouped, default to ‘chapter’
- Returns:
The grouped Scripture
- Return type:
pandas.DataFrame
- cwordtm.util.is_chi(*, timing=False, code=0)[source]
Checks whether the Chinese language flag is set.
- Returns:
True if the Chinese language flag (chi_flag) is set, False otherwise
- Return type:
bool
- cwordtm.util.load_csv(file_obj, doc_size=0, info=False, *, timing=False, code=0)[source]
Loads a CSV file with a “text” column.
- Parameters:
file_obj (str or io.BytesIO) – The prescribed file path from which the text is loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
doc_size (int, tuple, optional) – The number of documents to be loaded, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False
- Returns:
The collection of text with the prescribed number of rows loaded
- Return type:
pandas.DataFrame
- cwordtm.util.load_text(file_obj, doc_size=0, info=False, *, timing=False, code=0)[source]
Loads and returns the text from the prescribed file path.
- Parameters:
file_obj (str or io.BytesIO) – The prescribed file path from which the text is loaded, or a BytesIO object from Streamlit’s file_uploader, default to None
doc_size (int, tuple, optional) – The number of documents to be loaded, 0 represents all documents, or the range (tuple) of documents to be processed, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False
- Returns:
The collection of text with the prescribed number of rows loaded
- Return type:
pandas.DataFrame
- cwordtm.util.load_word(ver='web.csv', nr=0, info=False, *, timing=False, code=0)[source]
Loads and returns the text from the prescribed internal file (‘ver’).
- Parameters:
ver (str, optional) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional)(‘cuv.csv’), default to ‘web.csv’
nr (int, optional) – The number of rows of Scripture to be loaded; 0 represents all rows, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False
- Returns:
The collection of Scripture with the prescribed number of rows loaded
- Return type:
pandas.DataFrame
- cwordtm.util.preprocess_text(text, *, timing=False, code=0)[source]
Preprocesses English text by converting text to lower case, removing special characters and digits, removing punctuations, removing stopwords, removing short words, and Lemmatize text.
- Parameters:
text (str) – The text to be preprocessed, default to None
- Returns:
The preprocessed text
- Return type:
str
- cwordtm.util.remove_noise(text, noise_list, *, timing=False, code=0)[source]
Removes a list of substrings in noise_list from the input text.
- Parameters:
text (str) – The input text, default to None
noise_list (list, optional) – The list of substrings to be removed, default to “”
- Returns:
The text with the prescribed substrings removed
- Return type:
str
- cwordtm.util.reset_rows(*, timing=False, code=0)[source]
Reset the maximum no. of rows of DataFrames to be displayed to its default value.
- cwordtm.util.set_lang(lang='en', *, timing=False, code=0)[source]
Sets the prescribed language (English or Chinese (Traditional)) for further text processing.
- Parameters:
lang (str, optional) – The prescribed language for text processing, where ‘en’ stands for English or ‘chi’ for Traditonal Chinese, default to ‘en’
cwordtm.version module
- cwordtm.version.__author__ = 'Johnny Cheng'
Name of the author of the package
- cwordtm.version.__copyright__ = 'Copyright (c) 2025 - Johnny Cheng'
Copyright information
- cwordtm.version.__credits__ = ['Jehovah, the Lord']
Credit information
- cwordtm.version.__docs__ = 'https://cwordtm.readthedocs.io'
Package documentation on “Read the Docs” website
- cwordtm.version.__email__ = 'drjohnnycheng@gmail.com'
Author’s email address
- cwordtm.version.__url__ = 'https://github.com/drjohnnycheng/cwordtm.git'
GitHub repository for the package
- cwordtm.version.__version__ = '0.7.7'
Version information
cwordtm.viz module
- cwordtm.viz.chi_wordcloud(docs, figsize=(15, 10), bg='white', image=0, web_app=False, *, timing=False, code=0)[source]
Prepare and show a Chinese wordcloud
- Parameters:
docs (pandas.DataFrame) – The collection of Chinese documents for preparing a wordcloud, default to None
figsize (tuple, optional) – Size (width, height) of word cloud, default to (15, 10)
bg (str, optional) – The background color (name) of the wordcloud, default to ‘white’
image (int or str or BytesIO, optional) – The filename of the presribed image as the mask of the wordcloud, or 1/2/3/4 for using an internal image (heart / disc / triangle / arrow), default to 0 (No image mask)
web_app (bool) – The flag indicating the function is initiated from a web application, default to False
- Returns:
The wordcloud figure
- Return type:
matplotlib.pyplot.figure
- cwordtm.viz.plot_cloud(wordcloud, figsize, web_app=False, *, timing=False, code=0)[source]
Plot the prepared ‘wordcloud’
- Parameters:
wordcloud (WordCloud object) – The WordCloud object for plotting, default to None
figsize (tuple) – Size (width, height) of word cloud, default to None
web_app (bool) – The flag indicating the function is initiated from a web application, default to False
- Returns:
The wordcloud figure
- Return type:
matplotlib.pyplot.figure
- cwordtm.viz.show_wordcloud(docs, clean=False, figsize=(12, 8), bg='white', image=0, web_app=False, *, timing=False, code=0)[source]
Prepare and show a wordcloud
- Parameters:
docs (pandas.DataFrame) – The collection of documents for preparing a wordcloud, default to None
clean (bool, optional) – The flag whether text preprocessing is needed, default to False
figsize (tuple, optional) – Size (width, height) of word cloud, default to (12, 8)
bg (str, optional) – The background color (name) of the wordcloud, default to ‘white’
image (int or str or BytesIO, optional) – The filename of the presribed image as the mask of the wordcloud, or 1/2/3/4 for using an internal image (heart / disc / triangle / arrow), default to 0 (No image mask)
web_app (bool) – The flag indicating the function is initiated from a web application, default to False
- Returns:
The wordcloud figure
- Return type:
matplotlib.pyplot.figure