Vector representation of words for plagiarism detection based on string matching

Kensuke Baba, Tetsuya Nakatoh, Toshiro Minami

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    2 Citations (Scopus)

    Abstract

    Plagiarism detection in documents requires appropriate definition of document similarity and efficient computation of the similarity. This paper evaluates the validity of using vector representation of words for defining a document similarity in terms of the processing time and the accuracy in plagiarism detection. This paper proposes a plagiarism detection algorithm based on the score vector weighted by vector representation of words. The score vector between two documents represents the number of matches between corresponding words for every possible gap of the starting positions of the documents. The vector and its weighted version can be computed efficiently using convolutions. In this paper, two types of vector representation of words, that is, randomly generated vectors and a distributed representation generated by a neural network-based method from training data, are evaluated with the proposed algorithm. The experimental results show that using the weighted score vector instead of the normal one for the algorithm can reduce the processing time with a slight decrease of the accuracy, and that randomly generated vector representation is more suitable for the algorithm than the distributed representation in the sense of a tradeoff between the processing time and the accuracy.

    Original languageEnglish
    Title of host publicationHuman Interface and the Management of Information
    Subtitle of host publicationSupporting Learning, Decision-Making and Collaboration - 19th International Conference, HCI International 2017, Proceedings
    EditorsSakae Yamamoto
    PublisherSpringer Verlag
    Pages341-350
    Number of pages10
    ISBN (Print)9783319585239
    DOIs
    Publication statusPublished - 2017
    EventThematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017 - Vancouver, Canada
    Duration: Jul 9 2017Jul 14 2017

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume10274 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Other

    OtherThematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017
    Country/TerritoryCanada
    CityVancouver
    Period7/9/177/14/17

    All Science Journal Classification (ASJC) codes

    • Theoretical Computer Science
    • Computer Science(all)

    Fingerprint

    Dive into the research topics of 'Vector representation of words for plagiarism detection based on string matching'. Together they form a unique fingerprint.

    Cite this