Vector representation of words for plagiarism detection based on string matching

Kensuke Baba, Tetsuya Nakatoh, Toshiro Minami

    研究成果: 書籍/レポート タイプへの寄稿会議への寄与

    2 被引用数 (Scopus)

    抄録

    Plagiarism detection in documents requires appropriate definition of document similarity and efficient computation of the similarity. This paper evaluates the validity of using vector representation of words for defining a document similarity in terms of the processing time and the accuracy in plagiarism detection. This paper proposes a plagiarism detection algorithm based on the score vector weighted by vector representation of words. The score vector between two documents represents the number of matches between corresponding words for every possible gap of the starting positions of the documents. The vector and its weighted version can be computed efficiently using convolutions. In this paper, two types of vector representation of words, that is, randomly generated vectors and a distributed representation generated by a neural network-based method from training data, are evaluated with the proposed algorithm. The experimental results show that using the weighted score vector instead of the normal one for the algorithm can reduce the processing time with a slight decrease of the accuracy, and that randomly generated vector representation is more suitable for the algorithm than the distributed representation in the sense of a tradeoff between the processing time and the accuracy.

    本文言語英語
    ホスト出版物のタイトルHuman Interface and the Management of Information
    ホスト出版物のサブタイトルSupporting Learning, Decision-Making and Collaboration - 19th International Conference, HCI International 2017, Proceedings
    編集者Sakae Yamamoto
    出版社Springer Verlag
    ページ341-350
    ページ数10
    ISBN(印刷版)9783319585239
    DOI
    出版ステータス出版済み - 2017
    イベントThematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017 - Vancouver, カナダ
    継続期間: 7月 9 20177月 14 2017

    出版物シリーズ

    名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    10274 LNCS
    ISSN(印刷版)0302-9743
    ISSN(電子版)1611-3349

    その他

    その他Thematic track on Human Interface and the Management of Information, held as part of the 19th International Conference on Human–Computer Interaction, HCI International 2017
    国/地域カナダ
    CityVancouver
    Period7/9/177/14/17

    !!!All Science Journal Classification (ASJC) codes

    • 理論的コンピュータサイエンス
    • コンピュータ サイエンス(全般)

    フィンガープリント

    「Vector representation of words for plagiarism detection based on string matching」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

    引用スタイル