Extraction of Syntactically Similar Sentences from Huge Corpus for Language Research

International Journal of Computer Science and Engineering
© 2018 by SSRG - IJCSE Journal
Volume 5 Issue 8
Year of Publication : 2018
Authors : Sanjay Kumar, Sandhya Umrao

How to Cite?

Sanjay Kumar, Sandhya Umrao, "Extraction of Syntactically Similar Sentences from Huge Corpus for Language Research," SSRG International Journal of Computer Science and Engineering , vol. 5,  no. 8, pp. 1-5, 2018. Crossref, https://doi.org/10.14445/23488387/IJCSE-V5I8P101


The Corpus Based and statistical approaches exploits several heuristics to determine the summary- worthiness of sentences. It actually uses statistical appearances of words, words-pairs and noun phrases to calculate sentence weights and then extract the highest scoring sentences. The purpose of this research is to build a tool for Extraction of Syntactically similar sentences from huge corpus for language research. To discuss its design, use and implementation. The proposed tool is based on a logical approach to computational corpus linguistics where sentences of logic are used to express statements about texts and logical inference is used to manipulate these sentences in order to analyze the texts. The research based on functionalities needed in a corpus system can be implemented when based upon adequate means of representing, querying and reasoning. The proposed system implements hand coding, searching and parsing. Apart from being interesting from a practical point of view, the development of such a system raises intriguing philosophical and methodological questions: What is corpus texts? What is a corpus theory? What is the link between the truth of such a tool and its usefulness for natural language processing purposes? These and related questions are discussed in the research. The system exist in a prototype implementation and the research contains numerous examples from this implementation in action.


Corpus Linguistics, Corpus tools, Grammar, Grammar development, Logic programming.


[1] Stean Corley, Martin Corley, Frank Keller, Matthew W, “Finding Syntactic Structure in Unparsed Corpora” The Gsearch Corpus Query system. 
[2] Aho,A.V. and J.D Ullman “The theory of parsing, translation and compiling” 1972 
[3] Federica Mandreoli, Riccardo Martoglia “A Syntactic Approach for Searching Similarities within Sentences” 
[4] Sinha, RMK “Machine translation an Indian perspective” in Language Engineering Conference 2002. Procedding,13-15Dec2002 Pages:181-182 
[5] Jain, Sinha, RMK, “Role of examples in translation in systems, man, and cybernetics” 1995 , intelligent systems for 21st Century. IEEE international Conference on , Volume 2,22-25 Oct 1995, Pages: 1615-1620 
[6] Vilares Ferro, M; Alonso Pardo,M; Grana Gil,J; Cabrero Souto,D “ Tabular DCG parsing for natural language” In proceeding of the First workshop on Tabulation in Parsing and Deduction(TAPD-1998), Paris, PP 44-51