Archive for the ‘Plagiarism’ Category

Pentagon Says: Facts Don’t Matter (Pre-Trump)

Thursday, November 17th, 2016

Intel chairman: Pentagon plagiarized Wikipedia in report to Congress by Kristina Wong.

From the post:

The Pentagon submitted information plagiarized from Wikipedia to members of Congress, the chairman of the House Intelligence Committee said at a hearing Thursday.

Chairman Devin Nunes (R-Calif.) said on March 21, Deputy Defense Secretary Bob Work submitted a document to the chairmen of the House Intelligence, Armed Services, and Defense appropriations committees with information directly copied from Wikipedia, an online open-source encyclopedia.

The information was submitted in a document used to justify a determination that Croughton was the best location for a joint intelligence center with the United Kingdom, Nunes said. The determination was required by the 2016 National Defense Authorization Act.

If that weren’t bad enough, here’s the kicker:

Work said he still fulfilled the law by making a determination and that the plagiarized information had “no bearing” on that determination.

Do you read that to mean:

  1. Work made the determination
  2. The “made” determination was packed with facts to justify it

In that order?

Remarkably candid admission that Pentagon decisions are made and then those decisions are packed with facts to justify them.

Not particularly surprising to me.


Four free online plagiarism checkers

Friday, November 20th, 2015

Four free online plagiarism checkers

From the post:

“Detecting duplicate content online has become so easy that spot-the-plagiarist is almost a party game,” former IJNet editor Nicole Martinelli wrote in 2012. “It’s no joke, however, for news organizations who discover they have published copycat content.”

When IJNet first ran Martinelli’s post, “Five free online plagiarism checkers,” two prominent U.S. journalists had recently been caught in the act: Fareed Zakaria and Jonah Lehrer.

Following acknowledgement that he had plagiarized sections of an article about gun control, Time and CNN suspended Zakaria. Lehrer first came under scrutiny for “self-plagiarism” at The New Yorker. Later, a journalist revealed Lehrer also fabricated or changed quotes attributed to Bob Dylan in his book, “Imagine.”

To date, Martinelli’s list of free plagiarism checkers has been one of IJNet’s most popular articles across all languages. It’s clear readers want to avoid the pitfalls of plagiarism, so we’ve updated the post with four of the best free online plagiarism checkers available to anyone, revised for 2015:

Great resource for checking your content and that of others for plagiarism.

The one caveat I offer is to not limit the use of text similarity software solely to plagiarism.

Text similarity can be a test for finding content that you would not otherwise discover. Depends on how high you set the test for “similarity.”

And/or it may find content that is so similar, while not plagiarism (say multiple outlets writing from the same wire service) it isn’t worth the effort to read every story that repeats the same story with some minor edits.

Multiple stories but only one wire service source. In that sense, a “plagiarism” checker can enable you to skip duplicative content.

The post I quote above was published by the international journalist’s network (ijnet). Even if you aren’t a journalist, great source to follow for developing news technology.

Senator John Walsh plagiarism, color-coded

Wednesday, July 30th, 2014

Senator John Walsh plagiarism, color-coded by Nathan Yau.

Nathan points to a New York Times’ visualization that makes a telling case for plagiarism against Senator John Walsh.

Best if you see it at Nathan’s site, his blog formats better than mine does.

Senator Walsh was rather obvious about it but I often wonder how much news copy, print or electronic, is really original?

Some is I am sure but when a story goes out over AP or UPI, how much of it is repeated verbatim in other outlets?

It’s not plagiarism because someone purchased a license to repeat the stories but it certainly isn’t original.

If an AP/UPI story is distributed and re-played in 500 news outlets, it remains one story. With no more credibility than it had at the outset.

Would color coding be as effective against faceless news sources as they have been against Sen. Walsh?

BTW, if you are interested in the sordid details: Pentagon Watchdog to review plagiarism probe of Sen. John Walsh. Incumbents need not worry, Sen. Walsh is an appointed senator and therefore is an easy throw-away in order to look tough on corruption.

A Proposed Taxonomy of Plagiarism

Thursday, November 7th, 2013

A Proposed Taxonomy of Plagiarism Or, what we talk about when we talk about plagiarism by Rick Webb.

From the post:

What with the recent Rand Paul plagiarism scandal, I’d like to propose a new taxonomy of plagiarism. Some plagiarism is worse than others, and the basic definition of plagiarism that most people learned in school is only part of it.

Chris Hayes started off his show today by referencing the Wikipedia definition of plagiarism: “the ‘wrongful appropriation’ and ‘purloining and publication’ of another author’s ‘language, thoughts, ideas, or expressions,’ and the representation of them as one’s own original work.” The important point here that most people overlook is the theft of ideas. We all learn in school that plagiarism exists if we wholesale copy and paste other people’s words. But ideas are actually a big part of it.

Interesting read but I am not sure the taxonomy is fine grained enough.

Topic maps, like any other publication, has the potential for plagiarism. But I would make plagiarism distinctions for topic maps content based upon its intended audience.

For example, if I were writing a topic map about topic maps, there would be a lot of terms and subjects which I would use, relying on the background of the audience to know they did not originate with me.

But when I moved into the first instance of an idea being proposed, etc., then I should be using more formal citation because that enables the reader to track the development of a particular idea or strategy. It would be inappropriate to talk about tolog, for example, without crediting Lars Marius Garshol with its creation and clearly distinguishing any statements about tolog as being from particular sources.

All topic map followers already know those facts but in formal writing, you should help the reader with tracking down the sources you relied upon.

Completely different case in a committee discussion of tolog, no one is going to footnote their comments and hopefully if you are participating in a discussion of tolog, you are aware of its origins.

On the Rand Paul “scandal,” I think the media reaction cheapens the notion of plagiarism.

A better response to Rand Paul (you pick the topic) would be:

[Senator Paul], what you’ve just said is one of the most insanely idiotic things I have ever heard. At no point in your rambling, incoherent response were you even close to anything that could be considered a rational thought. Everyone in this room is now dumber for having listened to it. I award you no points, and may God have mercy on your soul. (Billy Madison)

A new slogan for CNN (original): CNN: Spreading Dumbness 24X7.

Inter-Document Similarity with Scikit-Learn and NLTK

Saturday, May 4th, 2013

Inter-Document Similarity with Scikit-Learn and NLTK by Sujit Pal.

From the post:

Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.

I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.

I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code – I first found this in Dr Manu Konchady’s TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.

Similarity measures are fairly well understood.

But they lack interesting data sets for testing code.

Here are some random suggestions:

  • Speeches by Republicans on Benghazi
  • Speeches by Democrats on Gun Control
  • TV reports on any particular disaster
  • News reports of sporting events
  • Dialogue from popular TV shows

With a five to ten second lag, perhaps streams of speech could be monitored for plagiarism or repetition and simply dropped.


LobbyPlag: compares text of EU regulation with texts of lobbyists’ proposals

Wednesday, February 13th, 2013

LobbyPlag: compares text of EU regulation with texts of lobbyists’ proposals

From the post:

A service called LobbyPlag lets users view provisions of EU regulations and compare them to provisions of lobbyists’ proposals.

The example currently available on LobbyPlag concerns the General Data Protection Regulation (GDPR).

Click here to see how LobbyPlag compares the GDPR’s forum shopping provision to what the site claims are lobbyists’ proposals for that provision.

LobbyPlag is an interesting use of legal text comparison tools to promote transparency.

See the original post for more details and links.

Another step in the right direction.

Data deduplication tactics with HDFS and MapReduce [Contractor Plagiarism?]

Wednesday, February 13th, 2013

Data deduplication tactics with HDFS and MapReduce

From the post:

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.

Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.

From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.

Covers five (5) tactics:

  1. Using HDFS and MapReduce only
  2. Using HDFS and HBase
  3. Using HDFS, MapReduce and a Storage Controller
  4. Using Streaming, HDFS and MapReduce
  5. Using MapReduce with Blocking techniques

In these times of “Great Sequestration,” how much you are spending on duplicated contractor documentation?

You do get electronic forms of documentation. Yes?

Not that difficult to document prior contractor self-plagiarism. Teasing out what you “mistakenly” paid for it may be harder.

Question: Would you rather find out now and correct or have someone else find out?

PS: For the ambitious in government employment. You might want to consider how discovery of contractor self-plagiarism reflects on your initiative and dedication to “good” government.

PlagSpotter [Ghost of Topic Map Past?]

Monday, December 10th, 2012

I found a link to PlagSpotter in the morning mail.

I found it quite responsive, although I thought the “Share and Help Your Friends Protect Their Web Content” rather limiting.

Here’s why:

To test the software, I choose a blog entry from another blog, one I quoted late yesterday, to test the timeliness of PlagSpotter.

And it worked!

While looking at the results, I saw people I expected to quote the same post, but then noticed there were people unknown to me on the list.

Rather than detecting plagiarism, the first off-label use of PlagSpotter is to identify communities quoting the same content.

With just a little more effort, the second off-label use of PlagSpotter is to track the spread of content across a community, by time. (With a little post processing, location, language as well.)

A third off-label use of PlagSpotter is to generate a list of sources that use the same content, a great seed list for a private search engine for a particular area/community.

The earliest identifiable discussion of topic maps as topic maps, involved detection of duplicated content (with duplicated charges for that content) for documentation in government contracts.

Perhaps why topic maps never gained much traction in government contracting. Cheats dislike being identified as cheats.

Ah, a fourth off-label use of PlagSpotter, detecting duplicated documentation submitted as part of weapon system or other documentation.

I find all four off-label uses of PlagSpotter more persuasive than protecting content.

Content only has value when other people use it, hopefully with attribution.