Web Page Classification Approach

P.Kameshwari, E.Salini Varma

Citation :

P.Kameshwari, E.Salini Varma, "Web Page Classification Approach," International Journal of Computer Science and Engineering , vol. 4, no. 2, pp. 8-11, 2017. Crossref, https://doi.org/10.14445/23488387/IJCSE-V4I2P103

Abstract

Textual document classification is one of the interesting areas of data mining. Textualdocuments may be arranged according to the topics or another characteristic (such as article type, writer, printing year etc.) some of the article consider only subject classification. Subject classification of documents based on two main ideas: the content-based method and the request-based method. Web page arrangement is onekind of textual document arrangement. Though, the text document presented in web pages is not similarin the meantime a web page can discuss correlated but dissimilar subjects. In consequence, results attained by a textual classifier are not as better as textual documents. Therefore, we need to improve the results of classifier using innovative technique.Firsttype of techniques that discourse this problem, by hidden the test setessential information to correct results,allocated by a textual classifier. In this article, discuss about a method that belongs to this category. Cross Training based Corrective approach (CTC) is new method for web page classification that acquiresdata from the test set in order to fixprimarily assigned by a text classifier on that test set. This technique can be tested using three basic classification algorithms: Support Vector Machine (SVM), Naïve Bayes (NB) and KNearest Neighbors (KNN), on four subdivisions of the Open Directory Project (ODP).

Keywords

Data mining, textual classifier, Cross Training based Corrective approach, Support Vector Machine (SVM), Naïve Bayes (NB), KNearest Neighbors (KNN),

References

(1) Aha, D., et D. Kibler. 1991. « Instance-based learning algorithms ». Machine Learning 6: 37-66.
(2) Breiman, Leo. 1996. « Bagging Predictors ». Machine Learning 24 (2): 123-40.
(3) F. Mosteller, et J. W. Tukey. « Data Analysis, Including Statistics ». In Handbook ofSocial Psychology (G. Lindzey and E. Aronson, eds.), 2eéd., 2:80-203. Addison-Wesley, Reading, MA.
(4) Freund, Yoav, et Robert E. Schapire. 1996. Experiments with a New Boosting Algorithm.
(5) Henderson, Lachlan. 2009. « Automated Text Classification in the DMOZ Hierarchy ».
(6) Jones, Karen Spärck. 1972. « A statistical interpretation of term specificity and its application in retrieval ». Journal of Documentation 28: 11-21.
(7) Liu, Yan, Zhenzhen Kou, Claudia Perlich, et Richard Lawrence. Intelligent System forWorkforce Classification.
(8) McCallum, Andrew, et Kamal Nigam. 1998. A comparison of event models for NaiveBayes text classification.
(9) Platt, John C. 1998. Sequential Minimal Optimization: A Fast Algorithm for TrainingSupport Vector Machines. ADVANCES IN KERNEL METHODS - SUPPORTVECTOR LEARNING.
(10) Rijsbergen, C. J. Van. 1979. Information Retrieval. 2nd éd.
(11) Wolpert, David H. 1992. « Stacked Generalization ». Neural Networks 5: 241-59.