Tài liệu số Trung tâm Thông tin Học liệu và Truyền thông: Nghiên cứu ứng dụng kỹ thuật học bán giám sát vào lĩnh vực phân loại văn bản tiếng Việt

(1)

THEMINISTRYOFEDUCATIONANDTRAINING THEUNIVERSITYOFDANANG

VODUYTHANH

ANAPPLIEDRESEARCHOF

SEMI-SUPERVISEDLEARNINGTECHNOLOGY INVIETNAMESETEXTCLASSIFICATIONFIELD

Major : COMPUTER SCIENCE Code : 62 48 01 01

SUMMARYOFDISSERTATIONFOR

DOCTOROFENGINEERING

Da Nang - 2017

(2)

THERESEARCHWASACCOMPLISHEDAT THEUNIVERSITYOFDANANG

Advisors:

1. Assoc. Prof. Dr Vo Trung Hung 2. Assoc. Prof. Dr Doan Van Ban

Reviewer 1: Prof. Dr Nguyen Mau Han Reviewer 2: Prof. Dr Phan Huy Khanh Reviewer 3: Prof. Dr Huynh Thi Thanh Binh

The dissertation was defended in front of The Dissertation Grading Council at The University of Danang level at The University of Danang on September 29 th 2017.

You can find the dissertation at:

- National Library of Vietnam;

- Learning Information Center, The University of Danang.

(3)

INTRODUCTION 1. Reasons for choosing the topic

Nowadays, the rapid development of science technology as well as information technology has brought people many abilities for approaching the information quickly and conveniently such as:

electronic library, electronic portal, search application… These things help people more conveniently in exchanging, updating, searching for information all over the world through the Internet.

Therefore, operating the automatic document classification nowadays is considered as an urgent problem and it attracts many researchers as well. In this dissertation, the author focused on investigating new methods for Vietnamese text classification more effectively which based on semi-supervised learning technology.

2. Literature review

In computer science field, semi-supervised learning is a machine learning technology class which combined the using of labeled data and unlabeled data in training. The quantity of labeled data is usually less than the quantity of unlabeled data because it requires a lot of time for labeling the data. Many researchers in machine learning field proposed that the combination of unlabeled data and a small quantity of labeled data can present many significant innovations in accurate learning.

a. Domestic literature review b. International literature review 3. Research target

The general target of this study is to investigate the application of semi-supervised learning technology in Vietnamese text classification.

(4)

4. Research objects and scope Research objects:

- Semi-supervised learning technology;

- Classification algorithms, clustering data in structured and semi- structured data space;

- Focusing on Vietnamese text classification.

5. Research content

- Determining a function or a method which enables to classify data layers efficiently (usually two layers);

- Making predictions about layers for unlabeled data;

- Examining the impact of the number of unlabeled data to the results of the algorithm;

- Developing testing software for Vietnamese text classification.

6. Research methodology - Documentation methodology - Empirical rmethodology - Expert methodology

7. Main contributions of the dissertation Main contributions of the dissertation include:

1. Proposing a new methodology in text classification based on Geodesic model and graph theory.

2. Proposing solutions reducing the dimensionality of a vector for text classification based on Dendrogram.

Building a data warehouse for Vietnamese text classification.

8. Dissertation structure

Main contents of the dissertation are presented in 4 chapters:

Chapter 1: Literature review

Chapter 2: Building a data warehouse

Chapter 3: Text classification based on Geodesic model

Chapter 4: Reducing the dimensionality of a vector based on

(5)

Dendrogram

Chapter 1. LITERATURE REVIEW 1.1. Machine learning

1.1.1. Definition

1.1.2. Application of machine learning 1.2. Machine learning methodologies 1.2.1. Supervised learning

1.2.2. Unsupervised learning 1.2.3. Semi-supervised learning 1.2.4. Reinforcement learning 1.2.5. Deep learning

1.3. Overview of semi-supervised learning 1.3.1. Semi-supervised learning methodologies - Expectation–maximization algorithm

- Transductive SVM - Self-training algorithm

Figure 1.1. Maximum-margin Figure 1.2 Visual performance of Self- hyperplane training setup

- Co-training algorithm

Figure 1.3 Visual performance of Co-training setup

(6)

1.3.2. SVM supervised learning algorithm and SVM semi- supervised learning algorithm

- Introduction

- Support vector machine (SVM) algorithm

Figure 1.4 Example of binary classification 1.3.3. SVM in text classification

1.3.4. Semi-supervised SVM and website classification 1.3.5. Typical text classification algorithm

1.4. Text classification 1.4.1. Text

1.4.2. Displaying text by vector

Figure 1.5 Displaying model text by specific vectors 1.4.3. Text classification

(7)

a. General model

Figure 1.6 General model of text classification system b. Classification steps

1.5. Proposed research

General model for text classification is presented as the figure below

Figure 1.7 Text classification model Figure 1.8 The proposed classification model 1.6. Conclusion

Chapter 2. BUILDING A DATA WAREHOUSE

2.1. Introduction of data warehouse for Vietnamese text classification

a. Introduction

(8)

b. Purpose of the data warehouse for Vietnamese text classification 2.2. Overview of the data warehouse

2.2.1. Definition of the data warehouse 2.2.2. Characteristics of the data warehouse 2.2.3. Purpose of the data warehouse 2.2.4. Data warehouse architectures a. Data warehouse architecture basic

Figure 2.1 Architecture of a data warehouse b. Data warehouse architecture with a staging area

Figure 2.2 Architecture of a data warehouse with a staging area Components of the data warehouse:

- Data Sources - Staging Area

(9)

- Metadata - Data Warehouse - Data Marts

2.3. Requirements Analysis 2.3.1. Data warehouse building

Table 2.1 Downloaded raw data No. Classification Number of

downloaded articles

Total size

1 Sport 1512 363411 KB

2 Education 1231 335561 KB

3 Law 1194 175410 KB

4 International 1208 255815 KB

5 Society 1152 232633 KB

2.3.2. Data warehouse exploration 2.3.3. Data warehouse update

2.4. Data analysis and specification

2.5. Data warehouse building methodology 2.5.1. A proposed general model

Figure 2.3 The proposed general data warehouse model 2.5.2. Process of building a data warehouse

2.5.3. Process of text classification program Step 1

Step 2

Step 3

(10)

Figure 2.4 Text classification process a. Data preprocessing

b. Text display Vector space model

Figure 2.5 Vector model in 3D space 2.5.4. Text classification using Naïve Bayes algorithm

Table 2.2 Training data

Text Confident Creative Ingenious Enthusiasm Class

Text 1 44 28 8 58 Sport

Text 2 12 31 40 4 Society

Text 4 35 42 10 47 Sport

Text 5 29 34 11 64 Sport

2.5.5. Formatting the data outputs in data warehouses a. Formatting the sample text

(11)

b. Example of the text’s format

2.6. Test result and evaluation of data warehouse 2.6.1. Test result of data warehouse

Table 2.3 Test result of data warehouse No. Topic Number of articles

1 Sport 1023

2 Education 1014

3 Law 987

4 International 1009

5 Society 994

2.6.2. Data warehouse evaluation 2.7. Conclusion

Chapter 3. TEXT CLASSIFICATION BASED ON GEODESIC MODEL

3.1. Geodesic model based on support vector machines 3.1.1. Geodesic model

Figure 3.1 Illustrations of Euclidean and Geodesic distances

Figure 3.2. The proposed model

(12)

3.1.2. Geodesic distance-based manifold clustering technology 3.1.3. Geodesic distance calculation methodology

3.1.4. Multiple functions in Geodesic distance-based support vector machine

For vector support, there are many multiple functions such as:

- Polynomial function (homogeneous):𝑘(𝑥_𝑘, 𝑥_𝑙) = (𝑥_𝑘∙ 𝑥_𝑙)^𝑑 - Polynomial function (inhomogeneous):𝑘(𝑥_𝑘, 𝑥_𝑙) = (𝑥_𝑘∙ 𝑥_𝑙+ 1)^𝑑 - Hyperbolic tangent function: 𝑘(𝑥_𝑘, 𝑥_𝑙) =tanh(𝛽𝑥_𝑘∙ 𝑥_𝑙+ 𝑐)

𝑤𝑖𝑡ℎ 𝛽 > 0 and 𝑐 < 0.

+ Gaussian function 𝑘(𝑥_𝑘, 𝑥_𝑙) =exp(−𝛾‖𝑥_𝑘− 𝑥_𝑙‖²)𝑤𝑖𝑡ℎ 𝛾 > 0 In this study, I propose the mutiple function of support vector machine which using Geodesic distance combined with Gausian function as follow:

𝑘(𝑥_𝑘, 𝑥_𝑙) =exp(−𝛾𝐷_𝑘𝑙) 𝑘(𝑥_𝑘, 𝑥_𝑙) =exp(−𝛾𝐷_𝑘(𝑥))

3.2. Text classification methodology based on Geodesic model Proposed model:

Figure 3.3 Text classification model based on Geodesic distance 3.3. Testing text classification based on Geodesic model

(13)

3.3.1. Application development 3.3.2. Data preparation

Table 3.1 Counting the number of file in data warehouse No. Type of

documents

Training

Test Total Labelled Unlabelled

1 Sport 10 613 400 1023

2 Education 10 604 400 1014

3 Law 10 577 400 987

4 International 10 599 400 1009

5 Society 10 584 400 994

3.3.3. Program deployment - Training function

- Text classification function.

3.3.4. Results

a. The first experiment

Table 3.2 The first classification result with the use of the traditional SVM Actual label

Label from classification results

Sport Education Law Internation Society Accuracy

%

Sport 887 0 58 78 0 86.7%

Education 0 516 225 159 114 51.0%

Law 24 0 864 62 37 87.5%

International 0 64 16 895 34 88.7%

Society 0 108 277 253 356 35.8%

Average rate of successful classification 69.9%

Table 3.3 The first classification result with the use of the proposed SVM Actual label

%

Sport 769 105 34 115 0 75.2%

Education 0 821 104 89 0 81.0%

Law 25 44 864 47 10 87.5%

International 17 23 21 932 16 92.4%

Society 74 67 172 326 356 35.7%

(14)

The average rate of successful classification on all topics is 69.9%

with the traditional SVM and 74.4% with the proposed method.

b. The second experiment

Table 3.4 The second classification result with the use of the traditional SVM Actual label

%

Sport 868 63 34 0 58 84.8%

Education 0 888 43 0 83 87.6%

Law 0 35 878 6 68 89.0%

International 0 18 122 826 43 81.9%

Society 45 29 502 29 389 39.1%

Table 3.5 The second classification result with the use of the proposed SVM Actual label

%

Sport 808 0 0 184 31 79.0%

Education 0 676 0 279 59 66.7%

Law 0 0 593 276 118 60.1%

International 15 0 0 899 95 89.1%

Society 0 0 54 378 562 56.5%

c. The third experiment

Table 3.6 The third classification result with the use of the traditional SVM Actual label

%

Sport 721 0 7 295 0 70.5%

Education 0 763 0 234 17 75.2%

Law 0 22 674 291 0 68.3%

International 0 19 0 990 0 98.1%

Society 0 51 83 557 303 30.5%

Table 3.7 The third classification result with the use of the proposed SVM Actual label

%

(15)

Sport 750 0 126 147 0 73.3%

Education 0 879 117 18 0 86.7%

Law 0 81 804 41 23 85.1%

International 0 33 242 720 14 71.4%

Society 0 74 261 208 451 45.3%

d. The fourth experiment

Table 3.8 The fourth classification result with the use of the traditional SVM Actual label

%

Sport 759 25 22 217 0 74.2%

Education 14 737 71 179 13 72.7%

Law 0 48 689 181 69 69.8%

International 21 54 68 808 58 80.1%

Society 3 83 177 158 573 57.6%

Table 3.9 The fourth classification result with the use of the proposed SVM Actual label

%

Sport 834 25 28 136 0 81.5%

Education 14 778 31 179 12 76.7%

Law 0 50 689 178 70 69.8%

International 21 52 54 824 56 81.7%

Society 3 83 209 156 543 54.6%

e. The fifth experiment

Table 3.10 The fifth classification result with the use of the traditional SVM Actual label

%

Sport 776 34 19 194 0 75.9%

Education 14 725 75 179 21 71.5%

Law 0 46 692 184 65 70.1%

International 12 41 54 805 97 79.8%

Society 11 83 241 156 503 50.6%

(16)

Table 3.11 The fifth classification result with the use of the proposed SVM Actual label

%

Sport 736 26 43 218 0 71.9%

Education 0 799 121 42 52 78.8%

Law 17 35 795 98 42 80.5%

International 0 27 134 792 56 78.5%

Society 49 51 168 153 573 57.6%

Figure 3.4 The average value and the variance of the rate classification based on the traditional SVM and the proposed method

The figure above shows the average value and the variance of the successful rate of classification using traditional SVM and the proposed method.

3.4. Conclusion

In this chapter, the author presented the results of text classification based on the proposed model which combined Geodesic model and support vector machine. The Geodesic model uses the shortest correlation (the adjacent level between texts) to calculate the distance between two vectors. This Geodesic distance is different from an Euclidean distance and helps to increase the accuracy of automatic

(17)

text classification, allow to classify many types instead of two types (based on binary subclass).

Chapter 4. REDUCING THE DIMENSIONALITY OF A VECTOR BASED ON DENDROGRAM

This chapter presents the proposed solution to reduce the dimensionality of a vector displaying Vietnamese text based on Dendrogram and documents taken from Wikipedia. Reducing the dimensionality of a vector will be applied in Vietnamese text classification through experiments.

4.1. Introduction

4.1.1. Definition of Dendrogram - Definition

Figure 4.1. Dendrogram 4.1.2. Proposed methodology

Figure 4.2 An example about Dendrogram 4.2. Building Dendrogram from Wikipedia data

(18)

4.2.1. Wikipedia processing algorithm

Figure 4.3 Diagram of Wikipedia data processing algorithm 4.2.2. Dictionary processing algorithm

Figure 4.4 Diagram of dictionary processing algorithm 4.2.3. P matrix calculation algorithm for common appearing frequency

4.2.4. Algorithm for building Dendrogram 4.2.5. Cluster analysis

a. Wikipedia processing b. Dictionary

c. Calculating the matrix of common appearing frequency d. Data organizing in program

(19)

4.2.6. Experiment 4.2.6.1. System structure 4.2.6.2. Functions a. Clustering function

Figure 4.5 Example of cutting Dendrogram, three groups are received b. Building classification model function

c. Classification function 4.2.6.3. Results

Clustering the dictionary shows the results as follow

Figure 4.6 The number of pairs of words according to the common appearing frequency.

(20)

Figure 4.7 The number of groups based on clustering on Dendrogram Cutting the dendrogram at 20% of the maximum distance gives a set of related words or synonyms as follow:

Figure 4.8 The result of using dendrogram to clustering

Figure 4.9 Another example shows words related to music.

(21)

Figure 4.10 An example of Dendrogram about words

Figure 4.11 An example shows words related to medicine 4.3. Applying words clustering into text classification 4.3.1. Input data

4.3.2. Experiment results a. Training model

Table 4.1 Training data, testing No. Type of

document

Training

Testing 1^st time 2^nd time 3^rd time 4^th time 5^th time

1 Sport 15 20 40 80 120 400

2 Education 15 20 40 80 120 400

3 Law 15 20 40 80 120 400

4 International 15 20 40 80 120 400

5 Society 15 20 40 80 120 400

(22)

Figure 4.12 The storage capacity of vectors depends on the number of words

Figure 4.13 Time of labeling of 5 times training

b. Text classification c. Accuracy of text classification

Figure 4.14 Average time for classifying text of 5 times training

Figure 4.15 Classification rates of 5 times training

(23)

d. The average accuracy of text classification

Figure 4.16 The change of results according to the classification rate Based on the figure above - reducing the dictionary can improve the accuracy of classification, if we choose the correct reduction rate for the dictionary (from 30% -> 70%) in accordance with initial vector space, the rate of text classification is higher than before – when we have not clustered and reduced words.

4.4. Conclusion

Results gotten through proposed methodologies aim to enhance the quality of Vietnamese text automatical classification. The first methodology uses Wikipedia encyclopedia and Dendrogram in reducing the dimensionality of a vector which displays Vietnamese text. The second methodology applies the reduced vector for text classification. Experiments show that the utilization of reduced vector space based on Dendrogram and Wikipedia library not only saves storage capacity and time for Vietnamese text classification but also guarantees the accurate classification rate, text classification rate is higher than when have not clustered.

The limitation of proposed methodology is just tested the common appearing frequency of pairs of words in one page of Wikipedia to

(24)

cluster, therefore it can lead to the untruth in semantics if that Wikipedia page has too much information. For example, one page covers much information about Sport, Law, Education… The following research will make good the limitations above.

CONCLUSION Achieved results

In this dissertation, the author presents research results in Vietnamese text classification with the combination of semi- supervised learning technology and support vector machine (SVM).

And there are many achieved results as follow:

- Built a data warehouse for Vietnamese text classification.

- Proposing and testing the text classification methodology based on Geodesic distance.

- Proposing and testing methodology for reducing the dimensionality of a vector when displaying Vietnamese text for increasing processing speed but still ensuring the accuracy when classify text.

Based on the results, the dissertation compared the proposed methodology which based on Geodesic distance to the traditional SVM model on the same data set. The classification’s average rate of 2 methodologies is not significantly different, however, the variance of the proposed method (± 2%) is smaller than that of the traditional SVM (± 4%). It suggests that the proposed method is more reliable than the traditional SVM for Vietnamese text classification.

Experiments show that the application of vector space which is reduced by Dendrogram and Wikipedia can not only help saving storage capacity and time for Vietnamese text classification but also

(25)

ensuring the correct classification rate in comparison with when hav not clustered. At the 30% - 70% reduction rate of the initial vector space, the correct classification rate is higher than when have not clustered.

Limitation of the dissertation

- Basically, the text classification program has almost completed the proposed functions such as helping users building the classification model for Vietnamese texts, automatically classifying new texts based on the established model. However, the initial data collection is just at the experiment stage.

- The limitation of this dissertation is not using WORDNET or making the graph to consider the semantic correlation among words before building feature vectors for text. This point can decrease the optimal ability when clustering.

- Reducing the dimensionality of a vector for text is just tested the common appearing frequency of pairs of words in one Wikipedia page to divide word groups, so it can cause wrong meaning if the Wikipedia has too many information such as one page includes information about Sport, Education, Law, International, Society…

- The dissertation has just tested on support vector machine (VSM).

- The dissertation has not compared to different Dendogram algorithms yet.

Next time, I will supplement several new functions and complete the program to enhance the effectiveness, at the same time, building a data warehouse enough for classifying text more correctly.

Proposal for future research

Nowadays, text summarization is the research trend which attracts many scientists, especially in Vietnamese field which has many

(26)

issues needed to be investigated. Therefore, the research trend about text summarization is still an open research. In the limitation of the dissertation, I suggest further research trend of this topic such as:

- Keep doing research about WORDNET which helps in looking up English semantics, from that building WORDNET for looking up Vietnamese. Or using the graph to optimize the interaction ability when creating a feature vector for text.

- For enhancing the effectiveness of semi-supervised learning model combined with text content summarizzation, I will keep doing research about methodologies for Vietnamese word separation in order to increase the accuracy of the methodology for taking main idea from the text content, at the same time, doing many different content compressing tests to find out higher content compress rate in order to improve the accuracy of the results in text classification according to the proposed model.

- Testing with the the common appearing frequency in one paragraph, in one sentence.

- Testing with an other dataset apart from Wikipedia, for example, articles in Vietnam online newspapers.

- Testing with other machine learning methodologies and comparing different Dendrogram algorithms.

(27)

LIST OF PUBLISHED SCIENTIFIC RESEARCH 1. Vo Duy Thanh, Vo Trung Hung, Pham Minh Tuan, Doan Van Ban, “Text classification based on semi-supervised learning”, Proceeding of the SoCPaR 2013, IEEE Catalog number CFP1395H- ART, ISBN 978-1-4799-3400-3/13/$31.00, pp. 238-242, 2013.

2. Vo Duy Thanh, Vo Trung Hung, Phạm Minh Tuan and Ho Khac Hung, “Text Classification Based On Manifold Semi-Supervised Support Vector Mahcine”, Proceeding of the ISDA 2014, 14th International Conference on Intelligent Systems Design and Applications, Okinawa, Japan 27-29, November 2014, IEEJ catalog, ISSN: 2150-7996, pp. 13-19.

3. Pham Minh Tuan, Nguyen Thi Le Quyen, Vo Duy Thanh, Vo Trung Hung, “Vietnamese Documents Classification Based on Dendrogram and Wikipedia”, Proceedings of Asian Conference on Information Systems 2014, ACIS 2014, December 1-3, 2014, Nha Trang, Viet Nam, © 2014 by ACIS 2014, ISBN: 978-4-88686-089-7, pp. 247-253.

4. Vo Duy Thanh, Vo Trung Hung, Ho Khac Hung, Tran Quoc Huy, “Text Classification Based On SVM And Text Summarization”, International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181, Vol. 4, Issue 02, February-2015, pp. 181- 186.

5. Vo Trung Hung, Nguyen Thi Ngoc Anh, Ho Phan Hieu, Nguyen Ngoc Huyen Tran, Vo Duy Thanh, “Comparison of the documents based on vector model”, In the Journal of Science and Technology, the University of Danang, ISSN: 1859-1531, No. 3(112)-2017, pp.

105-109.