Categorizing Chinese News Articles from the Web

2021 Introduction to Machine Learning Final Project Report

I. Introduction

In this project, we explore the machine learning pipeline and three different methods for categorizing Chinese news articles from the web (based on their title and content) into one of the following categories:

  1. 科技 (Technology)
  2. 產經 (Business and Economy)
  3. 娛樂 (Entertainment)
  4. 運動 (Sports)
  5. 社會 (Society)
  6. 政治 (Politics)

Motivation

Reading news from multiple sources can be cumbersome. I wish to access a single website where I can read articles from all these sources simultaneously. Additionally, I want these articles to be categorically labeled for easy location by selecting specific categories. As we know, every news website may have its categorizing schemes, for example:

libertynews
udn
chinatimes

Although the categories are quite specific for each source, this becomes a problem when we keep all of the articles in the same place (e.g. a database). This is because there may be duplicate labels or labels that are too similar and can be put in the same category. Therefore, I would like to find out a generalized way to categorize/classify news articles given the title and content.

II. Data Collection

A. News Source Selection

Initially, I considered gathering news articles from 3 to 4 news sources including: The Liberty Times 自由時報 , China Times 中國時報 , United Daily 聯合新聞網 , and Central News Agency CNA 中央通訊社 . But after some observation, I’ve realized that since CNA is a news agency (通訊社) rather than a newspaper, many newspapers may contain articles directly obtained from CNA, which would make creating a dataset difficult (considering the fact that the training set and test set would contain identical data).

An example of identical articles is shown here:

In order to prevent from identical articles appearing in the test set, I have decided to obtain news articles from The Central News Agency CNA (中央通訊社) only. I will then use the categories defined in the CNA website as the label to predict.

B. Data Crawling

To obtain articles, I wrote a Python script that uses BeautifulSoup to scrape news articles from CNA’s website, storing the data in MongoDB.

After selecting relevant information, the info of the obtained dataset is as follows:

  • Number of articles in total: 19826
  • Categories: 6 (科技, 產經, 娛樂, 政治, 社會, 運動)
  • Fields:
    • Title
    • Content
    • Category (Target to predict)
  • News date range:
    • 2021 Februrary ~ 2022 January 10th

A single article is shown here:

{
    '_id': {'$oid': '602b59c9df6290789c381d06'},
    'title': '修憲工程春節後啟動 柯建銘:絕不走極端',
    'content':
    [
    '(中央社記者溫貴香、王揚宇台北16日電)力拚2022憲改公投綁大選,民進黨團總召柯建銘今天表示,春節連假過後將啟動修憲工程,會負責任提出黨版修憲案,並強調任何題目都可以談且都會去思考,但民進黨絕不會走極端。',
    '民進黨立法院黨團總召柯建銘受訪表示,修憲應由總統主導,涉及國家政府體制的建立,這是總統無法迴避的責任,總統也不會迴避,要尊重由總統主導。民進黨是執政黨,要主導國會修憲。',
      '他說,2019年12月的美麗島事件40周年、世界人權日,立法院會三讀通過監察院國家人權委員會組織法,監察院下設國家人權委員會,由監察院長擔任主任委員。同時,修正通過考試院組織法部分條文,考試委員名額從19人改為7人至9人,考試院長、副院長及考試委員任期從6年改為4年,與總統任期一致。',
      '柯建銘說,18歲公民權已是普世價值,過去民進黨一直強烈主張,自創黨以來,民進黨中央從來沒有停止過修憲的討論,這次也是一樣;民進黨執政必須尊重由總統主導修憲,包括中央政府體制、五權變三權、不分區席次票票不等值、閣揆同意權等,民進黨一定會提出一個版本。',
      '柯建銘表示,修憲必須要有黨版,哪些問題可能會成功,哪些問題屬於理念闡述,都必須仔細評估,最後一定會有黨版。他強調,修憲提案沒有併大選絕對不會成功,即使併大選要成功闖關,也必須朝野有高度共識,否則會一事無成。',
      '他表示,修憲提案要併2022年直轄市暨縣市長選舉,必須先往回推算,約大選日9個月前要送出修憲提案;因為修憲提案必須先公告半年加上3個月後投票,總計需9個月時間,前置作業又包括何時開公聽會、要拋出什麼議題等,必須先行設定期程。',
      '至於修憲併2022年大選時間是否太趕,柯建銘表示,「不會」,修憲案若不併2022年大選根本是自殺式修憲,下修18歲公民權的修憲案,若做民調並不是壓倒性的勝利,必須各政黨高度動員與合作才有可能超過965萬票,這絕對不是朝野政黨比賽搶功勞、爭功諉過的題目。',
      '柯建銘表示,這次修憲有很多題目,不管是個別立委提案或民間團體倡議,例如環境權、人權入憲、勞動權等,都可以談且都會去思考;另涉及變更領土、國號等敏感性議題,民進黨不會走極端,因為面對世界大變局,兩岸局勢多變,不能有任何挑釁。',
      '他表示,965萬票是高門檻,修憲不能變成政治舞台表演,把修憲意義破壞掉;2月16日春節過後要開始進行,包括設定議題、舉辦公聽會,民進黨有一定的步驟。',
      '民進黨立法院黨團修憲小組由柯建銘親自領軍並設有雙召委機制,由資深立委管碧玲、具法律專業背景立委周春米,在總統主導下,府院黨加上黨團協力共同推動;總統府則由副秘書長李俊俋、民進黨秘書長林錫耀、行政院政務委員羅秉成,加上立法院黨團三長,最後提出黨版修憲案。',
      '柯建銘表示,2022年若18歲公民權修憲案沒有通過,這說不過去,這是「我們這一代政治人物的責任」,因為過關是高門檻,朝野必須高度動員。',
      '憲法增修條文第12條規定,「憲法之修改,須經立法院立法委員1/4之提議,3/4之出席,及出席委員3/4之決議,提出憲法修正案,並於公告半年後,經中華民國自由地區選舉人投票複決,有效同意票過選舉人總額之半數,即通過之。」換句話說,以2020年總統大選的選舉人數換算,要超過965萬票才算通過,外界視為超高門檻。(編輯:林克倫)1100216'
    ],
    'category': '政治'
}

The distribution of the categories is shown as follows:

cat-distribution

Category Distribution

III. Preprocessing

A. Data Cleaning

We would like to remove data that may affect our model while learning, for example, the reporter at the beginning of the article: (中央社記者溫貴香、王揚宇台北16日電) and the editor, date information in the end of the article: (編輯:林克倫)1100216.

This is because it is possible that 溫貴香 or 林克倫 always write news articles in a certain category (e.g. 政治), which would let the model learn irrelevant information. Since we want our model to be more general and can classify articles from other sources (where the editor or reporter does not belong to), we remove the respective words.

The resulting dataset looks as follows:

example

The category field is what we want to predict.

B. Text Segmentation and Tokenization

Chinese and English differ a lot in the sense that English is naturally segmented by spaces, but we have to manually separate words in Chinese.

To do so, we make use of the tool Jieba 中文分詞 , which performs text segmentation on Chinese text.

Before performing segmentation, we concatenate the title and content so that the text forms a single corpus. Then as we separate the we remove the punctuations to make the data even cleaner.

The result looks as follows:

BeforeAfter Segmentation
李登輝逝世週年 日台協會設文庫專區追思 前總統李登輝逝世滿週年,日本…李登輝 逝世 週年 日台 協會 設 文庫 專區 追思 前總統 李登輝 逝世 滿週年 日本…
男子捷運月台性騷擾女乘客 北院判拘役40天 一名楊姓男子去年8月間2度…男子 捷運 月台 性騷擾 女 乘客 北院 判 拘役 天 一名 楊姓 男子 去年 月間 度…

(I’ve also removed numbers on purpose since further processing may take numbers into account.)

C. Splitting the Train, Validation, Test set

Since the further tf-idf preprocessing stage takes the whole dataset into consideration (which would make us see the testing data if we did it first), we split the dataset into train, test, validation here.

We first split the data set into a training dataset and a testing dataset in a ratio of 7:3, the testing set will be used to evaulate the performance of the three models later.

Then, we split the training set further using holdout validation with ratio 7:3, where the new training holds 7/10 of the orginal training set and the validation set holds 3/10.

D. TF-IDF term weighting

After separating a piece corpus into segments of words, we want to encode the words in an article into features of a document (a single row in the dataframe).

Here, we use the term weighting scheme TF-IDF (Term Frequency-Inverse Document Frequency) to encode our text. TF-IDF in short, gives a word that appears frequently in one document (a high term-frequency (tf)), but not as frequent in all other documents (a high inverse document frequency (idf)) a higher weight, which can be thought of as more important. Then, we choose the top 10000 words with the highest frequencies as the features.

Scikit Learn provides a package function tfidfVectorizer to count the word frequencies then encode each article into a vector of 10000 features for us. Thus, by calling the package, it helps us to map each document in the training set to 10000 features.

In order to make sure that the mapping makes sense, we map the 10000 features to 2 dimensions using the t-SNE method, and plot the training set and its classes on it:

t-sne

We can see that each category has a rough boundary that can be distinguished.

E. Over-sampling the minority classes

Since the categories are quite unbalanced, we over-sample the minority classes to prevent mispredicting.

Originally, the testing set contains the following number of documents of each category:

CategoryCount
政治2717
產經2306
社會2029
運動1949
娛樂470
科技243

We over-sample the two classes 科技 and 娛樂 to 1000 samples.

CategoryCount
政治2717
產經2306
社會2029
運動1949
娛樂1000
科技1000

(After some experiments, the oversampling has improved about up to 50% of the recall score for the 科技 category in the Naive Bayes model, which shows that an imbalanced dataset may affect the performance a lot)

IV. Models

We choose the three most common models to perform text classification:

  • Logistic Regression
  • Naive Bayes
  • Artificial Neural Networks (Multilayer Perceptron)

For all models, we use off-the-shelf models provided by the sckit-learn package and train it on the encoded and upsampled training set above.

Note that the validation set is encoded separately from the training set (by tfidfVectorizer.transform), so that we won’t take the words in the validation set into consideration while counting frequencies in tfidf.

A. Logistic Regression

For the logistic regression model, we use the module LogisticRegression out of the box and train it on the augmented training set.

Then, we predict on the validation set, the obtained results are shown below:

logistic-regression

Validation accuracy: 0.9551

CategoryPrecisionRecall
政治0.94990.9371
產經0.94000.9483
社會0.96140.9724
運動0.98060.9829
娛樂0.95120.9466
科技0.88790.8636

A thing worth to note is that, before oversampling the minority classes 科技 and 娛樂, the recall score was 0.53 and 0.89 respectively, which shows that most of the articles belonging to 科技 was categorized to other classes (產經 in this case). Oversampling has prevented this situation from happening, and showed a 33% improvement in the recall score of the 科技 class. The accuracy of the model also improved from 94.52% to 95.51% for the prediction of the validation.

B. Naive Bayes Classifier

For the Naive Bayes model, I have chose to use the MultinomialNB module in sckit-learn, since it is one of the most common naive bayes model used for text classification.

The prediction results are shown below:

multinomialNB

Validation accuracy: 0.9352

CategoryPrecisionRecall
政治0.90390.9229
產經0.92110.9252
社會0.95190.9552
運動0.98150.9680
娛樂0.92080.9029
科技0.92550.7909

Similar to logistic regression, oversampling also showed a huge improvement in the recall score of the 科技 category, which raised from 21.48% to 79.09% (58% improvement). The overall accuracy has also improved from 92.11% to 93.52%.

C. Artificial Neural Networks

Due to the results from the previous model, I have decided to just create a simple neural network, since a decent result could be obtained just by simple models.

The multi-layer perceptron consists of 3 layers, the input layer, output layer, and one hidden layer which includes 1024 neurons.

I have chosen adam as the optimizer with $\alpha=10^{-5}$, the results are shown below:

nn

Validation accuracy: 0.9537

CategoryPrecisionRecall
政治0.95240.9388
產經0.92700.9505
社會0.96790.9679
運動0.98180.9874
娛樂0.94200.9466
科技0.86730.7727

Surprisingly, oversampling shows almost no difference when predicting using the neural network.

V. Results

The results of the final predictions (categorizations) on the testing set of each model are shown below.

Logistic Regression

lr-test

Accuracy: 0.9568

CategoryPrecisionRecall
政治0.95190.9468
產經0.94380.9466
社會0.96660.9712
運動0.98320.9849
娛樂0.95800.9580
科技0.85230.8380

Multinomial Naive Bayes

multinb-test

Accuracy: 0.9375

CategoryPrecisionRecall
政治0.90620.9360
產經0.93850.9159
社會0.95600.9545
運動0.97720.9698
娛樂0.93680.9336
科技0.83430.7877

Artificial Neural Network

nn-test

Accuracy: 0.9593

CategoryPrecisionRecall
政治0.95320.9498
產經0.93840.9576
社會0.97200.9720
運動0.98650.9849
娛樂0.97190.9685
科技0.88120.7877

VI. Conclusion

As long as the preprocessing is done correctly and having a correctly labelled dataset, using any of the models proposed should do a decent job in classifying the news articles, thus, I think our goal can be satisfied.

In the following subsections, we look at the model prections on real world data, then we discuss the possible applications on this topic.

A. Model Prediction on Real-World Unseen Data: Categorization Examples

Here, we fetch a few news articles as of January 13, 2022 from other news sources, and take a look at what the models will predict.

  1. 自由時報:台北電玩展「手遊」躍升主角!多間家機、一線遊戲公司缺席不參加 01/12/2022

台北電玩展(TGS)往年是台灣玩家必定朝聖的一大盛會,各大品牌都會在活動釋出最新情報、遊戲試玩,雖然受惠於台灣防疫有成,2021、2022 年都能舉辦實體活動,陣容卻仍受到嚴重打擊。 2022 年台北電玩展將於 22 日開跑,今日主辦單位台北電腦公會正式公開活動內容與參展陣容,前一年被許多網友笑稱是「手遊展」,今年趨勢似乎更加明確,同時陣容更受到打擊,不僅 Sony、微軟等家機遊戲無緣展出,包含萬代、SEGA、Ubisoft 等一線遊戲公司也都沒有參加實體活動。…

ModelLogistic RegressionNaive BayesArtificial Neural Network
Predicted Category科技產經科技
  1. 自由時報:證交法將修正 大股東持股5%要「全都露」01/12/2022

金管會昨天(11日)宣布,將啟動修正「證券交易法」共有2大修正內容,第1是強化我國公司持股透明化,讓藏鏡人無所遁形、大股東全都露,持股「申報」及「公告」門檻,從原本規定10%修訂為5%。第2是提高裁罰門檻,若證券商等相關機構,未建立落實內稽內控等重大缺失,罰鍰上限也將由現行480萬元拉高到600萬元。 證交法最近一次修正為去年元月27日,而金管會今年要再度啟動修法。…

ModelLogistic RegressionNaive BayesArtificial Neural Network
Predicted Category產經產經產經
  1. 中國時報:國民黨中常委提案改黨名 洪秀柱怒批:沒出息 01/13/2022

從四項公投到補選、罷免,國民黨在一個月內經歷了3場挫敗。國民黨有中常委提案,要修改國民黨的黨名為「台灣國民黨」。國民黨前主席洪秀柱今天(13日)對此怒斥,那是一個非常沒有出息的想法跟做法,如果黨內還是有人這樣去做,就離開這個政黨吧。洪秀柱13日受訪表示,民黨是個百年政黨,是創建中華民國的政黨,它的名字就叫中國國民黨…

ModelLogistic RegressionNaive BayesArtificial Neural Network
Predicted Category政治政治政治

We can see that most results follow what we assumed the categories of the articles should be, where the first article is 科技, second is 產經, and the third being 政治.

B. Applications

A tool to classify a set of news that come from several different sources

As I have mentioned in the introduction section, I would like to find out a generalized way to categorize/classify news articles given the title and content.

In this case, the classifier that we have trained can be used to provide a more general categorizing scheme. Therefore, if we had a database where we store news articles from many different sources, we can use any of the models to find out the category of the articles.