Categorizing Chinese News Articles from the Web

2021 Introduction to Machine Learning Final Project Report

I. Introduction

In this project, we explore the machine learning pipeline and three different methods for categorizing Chinese news articles from the web (based on their title and content) into one of the following categories:

  1. 科技 (Technology)
  2. 產經 (Business and Economy)
  3. 娛樂 (Entertainment)
  4. 運動 (Sports)
  5. 社會 (Society)
  6. 政治 (Politics)


Reading news from multiple sources can be cumbersome. I wish to access a single website where I can read articles from all these sources simultaneously. Additionally, I want these articles to be categorically labeled for easy location by selecting specific categories. As we know, every news website may have its categorizing schemes, for example:


Although the categories are quite specific for each source, this becomes a problem when we keep all of the articles in the same place (e.g. a database). This is because there may be duplicate labels or labels that are too similar and can be put in the same category. Therefore, I would like to find out a generalized way to categorize/classify news articles given the title and content.

II. Data Collection

A. News Source Selection

Initially, I considered gathering news articles from 3 to 4 news sources including: The Liberty Times 自由時報 , China Times 中國時報 , United Daily 聯合新聞網 , and Central News Agency CNA 中央通訊社 . But after some observation, I’ve realized that since CNA is a news agency (通訊社) rather than a newspaper, many newspapers may contain articles directly obtained from CNA, which would make creating a dataset difficult (considering the fact that the training set and test set would contain identical data).

An example of identical articles is shown here:

In order to prevent from identical articles appearing in the test set, I have decided to obtain news articles from The Central News Agency CNA (中央通訊社) only. I will then use the categories defined in the CNA website as the label to predict.

B. Data Crawling

To obtain articles, I wrote a Python script that uses BeautifulSoup to scrape news articles from CNA’s website, storing the data in MongoDB.

After selecting relevant information, the info of the obtained dataset is as follows:

  • Number of articles in total: 19826
  • Categories: 6 (科技, 產經, 娛樂, 政治, 社會, 運動)
  • Fields:
    • Title
    • Content
    • Category (Target to predict)
  • News date range:
    • 2021 Februrary ~ 2022 January 10th

A single article is shown here:

    '_id': {'$oid': '602b59c9df6290789c381d06'},
    'title': '修憲工程春節後啟動 柯建銘:絕不走極端',
    'category': '政治'

The distribution of the categories is shown as follows:


Category Distribution

III. Preprocessing

A. Data Cleaning

We would like to remove data that may affect our model while learning, for example, the reporter at the beginning of the article: (中央社記者溫貴香、王揚宇台北16日電) and the editor, date information in the end of the article: (編輯:林克倫)1100216.

This is because it is possible that 溫貴香 or 林克倫 always write news articles in a certain category (e.g. 政治), which would let the model learn irrelevant information. Since we want our model to be more general and can classify articles from other sources (where the editor or reporter does not belong to), we remove the respective words.

The resulting dataset looks as follows:


The category field is what we want to predict.

B. Text Segmentation and Tokenization

Chinese and English differ a lot in the sense that English is naturally segmented by spaces, but we have to manually separate words in Chinese.

To do so, we make use of the tool Jieba 中文分詞 , which performs text segmentation on Chinese text.

Before performing segmentation, we concatenate the title and content so that the text forms a single corpus. Then as we separate the we remove the punctuations to make the data even cleaner.

The result looks as follows:

BeforeAfter Segmentation
李登輝逝世週年 日台協會設文庫專區追思 前總統李登輝逝世滿週年,日本…李登輝 逝世 週年 日台 協會 設 文庫 專區 追思 前總統 李登輝 逝世 滿週年 日本…
男子捷運月台性騷擾女乘客 北院判拘役40天 一名楊姓男子去年8月間2度…男子 捷運 月台 性騷擾 女 乘客 北院 判 拘役 天 一名 楊姓 男子 去年 月間 度…

(I’ve also removed numbers on purpose since further processing may take numbers into account.)

C. Splitting the Train, Validation, Test set

Since the further tf-idf preprocessing stage takes the whole dataset into consideration (which would make us see the testing data if we did it first), we split the dataset into train, test, validation here.

We first split the data set into a training dataset and a testing dataset in a ratio of 7:3, the testing set will be used to evaulate the performance of the three models later.

Then, we split the training set further using holdout validation with ratio 7:3, where the new training holds 7/10 of the orginal training set and the validation set holds 3/10.

D. TF-IDF term weighting

After separating a piece corpus into segments of words, we want to encode the words in an article into features of a document (a single row in the dataframe).

Here, we use the term weighting scheme TF-IDF (Term Frequency-Inverse Document Frequency) to encode our text. TF-IDF in short, gives a word that appears frequently in one document (a high term-frequency (tf)), but not as frequent in all other documents (a high inverse document frequency (idf)) a higher weight, which can be thought of as more important. Then, we choose the top 10000 words with the highest frequencies as the features.

Scikit Learn provides a package function tfidfVectorizer to count the word frequencies then encode each article into a vector of 10000 features for us. Thus, by calling the package, it helps us to map each document in the training set to 10000 features.

In order to make sure that the mapping makes sense, we map the 10000 features to 2 dimensions using the t-SNE method, and plot the training set and its classes on it:


We can see that each category has a rough boundary that can be distinguished.

E. Over-sampling the minority classes

Since the categories are quite unbalanced, we over-sample the minority classes to prevent mispredicting.

Originally, the testing set contains the following number of documents of each category:


We over-sample the two classes 科技 and 娛樂 to 1000 samples.


(After some experiments, the oversampling has improved about up to 50% of the recall score for the 科技 category in the Naive Bayes model, which shows that an imbalanced dataset may affect the performance a lot)

IV. Models

We choose the three most common models to perform text classification:

  • Logistic Regression
  • Naive Bayes
  • Artificial Neural Networks (Multilayer Perceptron)

For all models, we use off-the-shelf models provided by the sckit-learn package and train it on the encoded and upsampled training set above.

Note that the validation set is encoded separately from the training set (by tfidfVectorizer.transform), so that we won’t take the words in the validation set into consideration while counting frequencies in tfidf.

A. Logistic Regression

For the logistic regression model, we use the module LogisticRegression out of the box and train it on the augmented training set.

Then, we predict on the validation set, the obtained results are shown below:


Validation accuracy: 0.9551


A thing worth to note is that, before oversampling the minority classes 科技 and 娛樂, the recall score was 0.53 and 0.89 respectively, which shows that most of the articles belonging to 科技 was categorized to other classes (產經 in this case). Oversampling has prevented this situation from happening, and showed a 33% improvement in the recall score of the 科技 class. The accuracy of the model also improved from 94.52% to 95.51% for the prediction of the validation.

B. Naive Bayes Classifier

For the Naive Bayes model, I have chose to use the MultinomialNB module in sckit-learn, since it is one of the most common naive bayes model used for text classification.

The prediction results are shown below:


Validation accuracy: 0.9352


Similar to logistic regression, oversampling also showed a huge improvement in the recall score of the 科技 category, which raised from 21.48% to 79.09% (58% improvement). The overall accuracy has also improved from 92.11% to 93.52%.

C. Artificial Neural Networks

Due to the results from the previous model, I have decided to just create a simple neural network, since a decent result could be obtained just by simple models.

The multi-layer perceptron consists of 3 layers, the input layer, output layer, and one hidden layer which includes 1024 neurons.

I have chosen adam as the optimizer with $\alpha=10^{-5}$, the results are shown below:


Validation accuracy: 0.9537


Surprisingly, oversampling shows almost no difference when predicting using the neural network.

V. Results

The results of the final predictions (categorizations) on the testing set of each model are shown below.

Logistic Regression


Accuracy: 0.9568


Multinomial Naive Bayes


Accuracy: 0.9375


Artificial Neural Network


Accuracy: 0.9593


VI. Conclusion

As long as the preprocessing is done correctly and having a correctly labelled dataset, using any of the models proposed should do a decent job in classifying the news articles, thus, I think our goal can be satisfied.

In the following subsections, we look at the model prections on real world data, then we discuss the possible applications on this topic.

A. Model Prediction on Real-World Unseen Data: Categorization Examples

Here, we fetch a few news articles as of January 13, 2022 from other news sources, and take a look at what the models will predict.

  1. 自由時報:台北電玩展「手遊」躍升主角!多間家機、一線遊戲公司缺席不參加 01/12/2022

台北電玩展(TGS)往年是台灣玩家必定朝聖的一大盛會,各大品牌都會在活動釋出最新情報、遊戲試玩,雖然受惠於台灣防疫有成,2021、2022 年都能舉辦實體活動,陣容卻仍受到嚴重打擊。 2022 年台北電玩展將於 22 日開跑,今日主辦單位台北電腦公會正式公開活動內容與參展陣容,前一年被許多網友笑稱是「手遊展」,今年趨勢似乎更加明確,同時陣容更受到打擊,不僅 Sony、微軟等家機遊戲無緣展出,包含萬代、SEGA、Ubisoft 等一線遊戲公司也都沒有參加實體活動。…

ModelLogistic RegressionNaive BayesArtificial Neural Network
Predicted Category科技產經科技
  1. 自由時報:證交法將修正 大股東持股5%要「全都露」01/12/2022

金管會昨天(11日)宣布,將啟動修正「證券交易法」共有2大修正內容,第1是強化我國公司持股透明化,讓藏鏡人無所遁形、大股東全都露,持股「申報」及「公告」門檻,從原本規定10%修訂為5%。第2是提高裁罰門檻,若證券商等相關機構,未建立落實內稽內控等重大缺失,罰鍰上限也將由現行480萬元拉高到600萬元。 證交法最近一次修正為去年元月27日,而金管會今年要再度啟動修法。…

ModelLogistic RegressionNaive BayesArtificial Neural Network
Predicted Category產經產經產經
  1. 中國時報:國民黨中常委提案改黨名 洪秀柱怒批:沒出息 01/13/2022


ModelLogistic RegressionNaive BayesArtificial Neural Network
Predicted Category政治政治政治

We can see that most results follow what we assumed the categories of the articles should be, where the first article is 科技, second is 產經, and the third being 政治.

B. Applications

A tool to classify a set of news that come from several different sources

As I have mentioned in the introduction section, I would like to find out a generalized way to categorize/classify news articles given the title and content.

In this case, the classifier that we have trained can be used to provide a more general categorizing scheme. Therefore, if we had a database where we store news articles from many different sources, we can use any of the models to find out the category of the articles.