Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 7, 2018

Are AI Psychopaths Cost Effective?

Filed under: Artificial Intelligence,Machine Learning,Reddit — Patrick Durusau @ 3:34 pm

Norman, World’s first psychopath AI

From the webpage:


We present you Norman, world’s first psychopath AI. Norman is born from the fact that the data that is used to teach a machine learning algorithm can significantly influence its behavior. So when people talk about AI algorithms being biased and unfair, the culprit is often not the algorithm itself, but the biased data that was fed to it. The same method can see very different things in an image, even sick things, if trained on the wrong (or, the right!) data set. Norman suffered from extended exposure to the darkest corners of Reddit, and represents a case study on the dangers of Artificial Intelligence gone wrong when biased data is used in machine learning algorithms.

Norman is an AI that is trained to perform image captioning; a popular deep learning method of generating a textual description of an image. We trained Norman on image captions from an infamous subreddit (the name is redacted due to its graphic content) that is dedicated to document and observe the disturbing reality of death. Then, we compared Norman’s responses with a standard image captioning neural network (trained on MSCOCO dataset) on Rorschach inkblots; a test that is used to detect underlying thought disorders.

Note: Due to the ethical concerns, we only introduced bias in terms of image captions from the subreddit which are later matched with randomly generated inkblots (therefore, no image of a real person dying was utilized in this experiment).

I have written to the authors to ask for more details about their training process, the “…which are later matched with randomly generated inkblots…” seeming especially opaque to me.

While waiting for that answer, we should ask whether training psychopath AIs is cost effective?

Compare the limited MIT-Norman with the PeopleFuckingDying Reddit with 690,761 readers.

That’s a single Reddit. Many Reddits count psychopaths among their members. To say nothing of Twitter trolls and other social media where psychopaths gather.

New psychopaths appear on social media every day, without the ethics-limited training provided to MIT-Norman. Is this really a cost effective approach to developing psychopaths?

The MIT-Norman project has great graphics but Hitchcock demonstrated over and over again, simple scenes can be packed with heart pounding terror.

May 5, 2016

Mentioning Nazis or Hitler

Filed under: Natural Language Processing,Reddit — Patrick Durusau @ 9:51 am

78% of Reddit Threads With 1,000+ Comments Mention Nazis

From the post:

Let me start this post by noting that I will not attempt to test Godwin’s Law, which states that:

As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches 1.

In this post, I’ll only try to find out how many Reddit comments mention Nazis or Hitler and ignore the context in which they are made. The data source for this analysis is the Reddit dataset which is publicly available on Google BigQuery. The following graph is based on 4.6 million comments and shows the share of comments mentioning Nazis or Hitler by subreddit.

Left for a later post:

The next step would be to implement sophisticated text mining techniques to identify comments which use Nazi analogies in a way as described by Godwin. Unfortunately due to time constraints and the complexity of this problem, I was not able to try for this blog post.

Since Godwin’s law applies to inappropriate invocations of Nazis or Hitler, that implies there are legitimate uses of those terms.

What captures my curiosity is what characteristics must a subject have to be a legitimate comparison to Nazis and/or Hitler?

Or more broadly, what characteristics must a subject have to be classified as a genocidal ideology or a person who advocates genocide?

Thinking it isn’t Nazism (historically speaking) that needs to be avoided but the more general impulse that leads to genocidal rhetoric and policies.

February 29, 2016

Scraping Reddit

Filed under: Reddit — Patrick Durusau @ 8:02 pm

Scrapping Reddit by Daniel Donohue.

From the post:

For our third project here at NYC Data Science, we were tasked with writing a web scraping script in Python. Since I spend (probably too much) time on Reddit, I decided that it would be the basis for my project. For the uninitiated, Reddit is a content-aggregator, where users submit text posts or links to thematic subforums (called “subreddits”), and other users vote them up or down and comment on them. With over 36 million registered users and nearly a million subreddits, there is a lot of content to scrape.

Daniel walks through his scraping and display of the resulting data.

In case you are sort on encrypted core dumps, you can fill up a stack of DVDs with randomized and encrypted Reddit posts. Just something to leave for unexpected visitors to find.

Be sure to use a Sharpie to copy Arabic letters on some of the DVDs.

Who knows? Someday your post to Reddit, in its encrypted form, may serve to confound and confuse the FBI.

February 15, 2016

Automating Family/Party Feud

Filed under: Natural Language Processing,Reddit,Vectors — Patrick Durusau @ 11:19 am

Semantic Analysis of the Reddit Hivemind

From the webpage:

Our neural network read every comment posted to Reddit in 2015, and built a semantic map using word2vec and spaCy.

Try searching for a phrase that’s more than the sum of its parts to see what the model thinks it means. Try your favourite band, slang words, technical things, or something totally random.

Lynn Cherny suggested in a tweet to use “actually.”

If you are interested in the background on this tool, see: Sense2vec with spaCy and Gensim by Matthew Honnibal.

From the post:

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et al., 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we’ve found surprisingly addictive.

Polysemy: the problem with word2vec

When humans write dictionaries and thesauruses, we define concepts in relation to other concepts. For automatic natural language processing, it’s often more effective to use dictionaries that define concepts in terms of their usage statistics. The word2vec family of models are the most popular way of creating these dictionaries. Given a large sample of text, word2vec gives you a dictionary where each definition is just a row of, say, 300 floating-point numbers. To find out whether two entries in the dictionary are similar, you ask how similar their definitions are – a well-defined mathematical operation.

Certain to be a hit at technical conferences and parties.

SGML wasn’t mentioned even once during 2015 in Reddit Comments.

Try some your favorites words and phrases.

Enjoy!

December 7, 2015

Jupyter on Apache Spark [Holiday Game]

Filed under: Python,Reddit,Spark — Patrick Durusau @ 4:46 pm

Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how install Jupyter on your Spark cluster and use PySpark for some ad hoc analysis of reddit comment data on Amazon S3.

This following tutorial installs Jupyter on your Spark cluster in standalone mode on top of Hadoop and also walks through some transformations and queries on the reddit comment data on Amazon S3. We assume you already have an AWS EC2 cluster up with Spark 1.4.1 and Hadoop 2.7 installed. If not, you can go to our previous post on how to quickly deploy your own Spark cluster.

In Need a Bigoted, Racist Uncle for Holiday Meal? I mentioned the 1.6 billion Reddit comments that are the subject of this tutorial.

If you can’t find comments offensive to your guests in the Reddit comment collection, they are comatose and/or inanimate objects.

Big Data Holiday Game:

Divide into teams with at least one Jupyter/Apache Spark user on each team.

Play three timed rounds (time for each round dependent on your local schedule) where each team attempts to discover a Reddit comment that is the most offensive for the largest number of guests.

The winner gets bragging rights until next year, you get to show off your data mining skills, not to mention, you get a free pass on saying offensive things to your guests.

Watch for more formalized big data games of this nature by the holiday season for 2016!

Enjoy!

I first saw this in a tweet by Data Science Renee.

July 12, 2015

Reddit Archive! 1 TB of Comments

Filed under: Data,Reddit — Patrick Durusau @ 2:34 pm

You can now download a dataset of 1.65 billion Reddit comments: Beware the Redditor AI by Mic Wright.

From the post:

Once our species’ greatest trove of knowledge was the Library of Alexandria.

Now we have Reddit, a roiling mass of human ingenuity/douchebaggery that has recently focused on tearing itself apart like Tommy Wiseau in legendarily awful flick ‘The Room.’

But unlike the ancient library, the fruits of Reddit’s labors, good and ill, will not be destroyed in fire.

In fact, thanks to Jason Baumgartner of PushShift.io (aided by The Internet Archive), a dataset of 1.65 billion comments, stretching from October 2007 to May 2015, is now available to download.

The data – pulled using Reddit’s API – is made up of JSON objects, including the comment, score, author, subreddit, position in the comment tree and a range of other fields.

The uncompressed dataset weighs in at over 1TB, meaning it’ll be most useful for major research projects with enough resources to really wrangle it.

Technically, the archive is incomplete, but not significantly. After 14 months of work and many API calls, Baumgartner was faced with approximately 350,000 comments that were not available. In most cases that’s because the comment resides in a private subreddit or was simply removed.

If you don’t have a spare TB of space at the moment, you will also be interested in: http://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/, where you will find several BigQueries already.

The full data set certainly makes an interesting alternative to the Turing test for AI. Can you AI generate without assistance or access to this data set, the responses that appear therein? Is that a fair test for “intelligence?”

If you want updated data, consult the Reddit API.

Powered by WordPress