Skip to main content

Command Palette

Search for a command to run...

Easily detect plagiarism in texts using pysimilar (python)

plagiarism-detection-python

Updated
3 min read
Easily detect plagiarism in texts using pysimilar (python)
K

Mechatronics Engineer || Self-taught Python Developer || AI/ML Enthusiast || Tech nerd

Hi guys

I recently wrote an article titled How to detect plagiarism in text using python where by I shown how you can easily detect the plagiarism between documents as title says manually using cosine similarity.

I republished that article on multiple platform including here on Hashnode and Hackernoon, and its one of my most viewed article plus most starred GitHub repository among articles repositories.

Which gave me a second thought to refactor the code/article to make it more easily and friendly to get started with even for absolutely beginners leading me to build a python library pysimilar which I can say simplify it to the maximum;

Getting started with Pysimilar

To get started with pysimilar for comparing text documents, you just need to install first of which you can either install directly from github or using pip.

Here how to install pysimilar using pip

$ pip install pysimilar

Here how to install directly from github

$ git clone https://github.com/Kalebu/pysimilar
$ cd pysimilar
$ pysimilar -> python setup.py install

With Pysimilar you can either compare text documents as strings or specify the path to the file containing the textual documents.

Comparing strings directly

You can easily compare strings using pysimilar using compare() method just as illustrated below;

>>> from pysimilar import compare
>>> compare('very light indeed', 'how fast is light')
0.17077611319011649

Comparing strings contained files

To compare strings contained in the files, you just need to explicit specify the isfile parameter to True just as illustrated below;

>>> compare('README.md', 'LICENSE', isfile=True)
0.25545580376557886

You can also compare documents with particular extension in a given directory, for instance let's say I want to compare all the documents with .txt in a documents directory here is what I will do;

Directory for documents used by the example below look like this

documents/
├── anomalie.zeta
├── hello.txt
├── hi.txt
└── welcome.txt

Here how to compare files of a particular extension

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = '.txt'
>>> comparison_result = pysimilar.compare_documents('documents')
>>> [['welcome.txt vs hi.txt', 0.6053485081062917],
    ['welcome.txt vs hello.txt', 0.0],
    ['hi.txt vs hello.txt', 0.0]]

You can also sort the comparison score based on their score by changing the ascending parameter, just as shown below;

>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['welcome.txt vs hi.txt', 0.6053485081062917]]

You can also set pysimilar to include files with multiple extensions

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = ['.txt', '.zeta']
>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['anomalie.zeta vs hi.txt', 0.4968161174826459],
 ['welcome.txt vs hi.txt', 0.6292275146695526],
 ['welcome.txt vs anomalie.zeta', 0.7895651507603823]]

Well that's all for this article, Excited to see what you are going tol build with it

Here a link to Github Repository

E

The information you give is very helpful, I have researched it a lot. Then can play some games like amanda the adventurer

F

dumb ways to die is a great game, very good

H

Strong supercars have always interested you, but you've never had the chance to really own one. eggy car

K

Currently, stumble guys is a hot action game where you can relax and make friends from across the world.

A

word hurdle is an excellent vocabulary builder for children and adults alike

M

Python's decimal module provides support for fast correctly-rounded decimal floating point arithmetic. The decimal module is especially useful for computations involving money and for other applications where exact decimal representation is required.

The float type in Python represents a floating point number. Float values are represented in computer memory by a mantissa and an exponent. The mantissa stores the significant digits of the number while the exponent represents the power of 10 by which the mantissa must be multiplied to obtain the actual value.

While floats are accurate to approximately six or seven decimal places, they can sometimes give unexpected results due to rounding errors. Decimals, on the other hand, have a fixed number of decimal places and no rounding errors will occur. For this reason, decimals are often used for financial calculations.

upsers prepaidgiftbalance

M

Your writing is really informative, especially because it's so meaningful and updated. Thanks for sharing this wonderful post!

Your writing is really great. I’m so glad I read it. It kept me hooked the whole way through.

Thanks for this information. I really appreciate the information that you have provided.

mybkexperience mykfcexperience mybpcreditcard