Technology is rapidly evolving, and with it comes increasingly sophisticated bots (i.e. software robots) which automatically produce content to inform, influence, and deceive genuine users. This is particularly a problem for social media networks where content tends to be extremely short, informally written, and full of inconsistencies. Motivated by the rise of bots on these networks, we investigate the ease with which a bot can deceive a human. In particular, we focus on deceiving a human into believing that an automatically generated sample of text was written by a human, as well as analysing which factors affect how convincing the text is. To accomplish this, we train a set of models to write text about several distinct topics, to simulate a bot’s behaviour, which are then evaluated by a panel of judges. We find that: (1) typical Internet users are twice as likely to be deceived by automated content than security researchers; (2) text that disagrees with the crowd’s opinion is more believably human; (3) light-hearted topics such as Entertainment are significantly easier to deceive with than factual topics such as Science; and (4) automated text on Adult content is the most deceptive regardless of a user’s background.
The statistics presented are impressive:
We found that automated text is twice as likely to deceive Internet users than security researchers. Also, text that disagrees with the Crowd’s opinion increases the likelihood of deception by up to 78%, while text on light-hearted Topics such as Entertainment increases the likelihood by up to 85%. Notably, we found that automated text on Adult content is the most deceptive for both typical Internet users and security researchers, increasing the likelihood of deception by at least 30% compared to other Topics on average. Together, this shows that it is feasible for a party with technical resources and knowledge to create an environment populated by bots that could successfully deceive users.
… (at page 1120)
To evaluate those statistics consider the judges panels that create the supporting data:
To evaluate this test dataset, a panel of judges is used where every judge receives the entire test set with no other accompanying data such as Topic and Crowd opinion. Then, each judge evaluates the comments based solely on their text and labels each as either human or bot, depending who they believe wrote it. To fill this panel, three judges were selected – in keeping with the average procedure of the work highlighted by Bailey et al.  – for two distinct groups:
- Group 1: Three cyber security researchers who are actively involved in security work with an intimate knowledge of the Internet and its threats.
- Group 2: Three typical Internet users who browse social media daily but are not experienced with technology or security, and therefore less aware of the threats.
… (pages 1117-1118)
The paper reports human vs. generation evaluations of topics by six (6) people.
I’m suddenly less impressed than I hoped to be from reading the abstract.
A more informative title would have been: 6 People Classify Machine/Human Generated Reddit Comments.
To their credit, the authors were explicit about the judging panels in their study.
I am forced to conclude peer review wasn’t used for the SAC 2016 31st ACM Symposium on Applied Computing or its peer reviewers left a great deal to be desired.
As a conference goer, would you be interested in human/machine judgments of six unknown panelists?