Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 2, 2014

Testing LDA

Filed under: Latent Dirichlet Allocation (LDA),Text Mining,Tweets — Patrick Durusau @ 2:12 pm

Using Latent Dirichlet Allocation to Categorize My Twitter Feed by Joseph Misiti.

From the post:

Over the past 3 years, I have tweeted about 4100 times, mostly URLS, and mostly about machine learning, statistics, big data, etc. I spent some time this past weekend seeing if I could categorize the tweets using Latent Dirichlet Allocation. For a great introduction to Latent Dirichlet Allocation (LDA), you can read the following link here. For the more mathematically inclined, you can read through this excellent paper which explains LDA in a lot more detail.

The first step to categorizing my tweets was pulling the data. I initially downloaded and installed Twython and tried to pull all of my tweets using the Twitter API, but that quickly realized there was an archive button under settings. So I stopped writing code and just double clicked the archive button. Apparently 4100 tweets is fairly easy to archive, because I received an email from Twitter within 15 seconds with a download link.

When you read Joseph’s post, note that he doesn’t use the content of his tweets but rather the content of the URLs he tweeted as the subject of the LDA analysis.

Still a valid corpus for LDA analysis but I would not characterize it as “categorizing” his tweet feed, meaning the tweets, but rather “categorizing” the content he tweeted about. Not the same thing.

A useful exercise because it uses LDA on a corpus with which you should be familiar, the materials you tweeted about.

As opposed to using LDA on a corpus that is less well known to you and you are reduced to running sanity checks with no real feel for the data.

It would be an interesting exercise, to discover the top topics for the corpus you tweeted about (Joseph’s post) and also for the corpus of #tags that you used in your tweets. Are they the same or different?

I first saw this in a tweet by Christophe Lalanne.

Palladio

Filed under: Humanities,Visualization — Patrick Durusau @ 1:53 pm

Palladio – Humanities thinking about data visualization

From the webpage:

Palladio is a web-based platform for the visualization of complex, multi-dimensional data. It is a product of the "Networks in History" project that has its roots in another humanities research project based at Stanford: Mapping the Republic of Letters (MRofL). MRofL produced a number of unique visualizations tied to individual case studies and specific research questions. You can see the tools on this site and read about the case studies at republicofletters.stanford.edu.

With "Networks in History" we are taking the insights gained and lessons learned from MRofL and applying them to a set of visualizations that reflect humanistic thinking about data. Palladio is our first step toward opening data visualization to any researcher by making it possible to upload data and visualize within the browser without any barriers. There is no need to create an account and we do not store the data. On the visualization side, we have emphasized tools for filtering. There is a timeline filter that allows for filtering on discontinuous time periods. There is a facet filter based on Moritz Stefaner's Elastic Lists that is particularly useful when exploring multidimensional data sets.

The correspondence networks in the Mapping the Republic of Letters (MRofL) project will be of particular interest to humanists.

Quite challenging on their own but imagine the utility of exploding every letter into different subjects and statements about subjects, which automatically map to other identified subjects and statements about subjects in other correspondence.

Scholars already know about many such relationships in intellectual history but those associations are captured in journals, monographs, identified in various ways and lack in many cases, explicit labeling of roles. To say nothing of having to re-tread the path of an author to discover their recording of such associations in full text form.

If such paths were easy to follow, the next generation of scholars would develop new paths, as opposed to making known ones well-worn.

EuroClojure 2014 (notes)

Filed under: Clojure,Conferences — Patrick Durusau @ 10:59 am

EuroClojure 2014 (notes) by Philip Potter.

A truly amazing set of notes, with links, for EuroClojure 2014.

It’s not like being there but you will come away with new ideas, links to follow and the intention to attend EuroClojure 2015!

Enjoy!

Frege in Space:…

Filed under: Linguistics,Semantics — Patrick Durusau @ 10:09 am

Frege in Space: A Program of Compositional Distributional Semantics by Marco Baroni, Raffaela Bernardi, Roberto Zamparelli.

Abstract:

The lexicon of any natural language encodes a huge number of distinct word meanings. Just to understand this article, you will need to know what thousands of words mean. The space of possible sentential meanings is infinite: In this article alone, you will encounter many sentences that express ideas you have never heard before, we hope. Statistical semantics has addressed the issue of the vastness of word meaning by proposing methods to harvest meaning automatically from large collections of text (corpora). Formal semantics in the Fregean tradition has developed methods to account for the infinity of sentential meaning based on the crucial insight of compositionality, the idea that meaning of sentences is built incrementally by combining the meanings of their constituents. This article sketches a new approach to semantics that brings together ideas from statistical and formal semantics to account, in parallel, for the richness of lexical meaning and the combinatorial power of sentential semantics. We adopt, in particular, the idea that word meaning can be approximated by the patterns of co-occurrence of words in corpora from statistical semantics, and the idea that compositionality can be captured in terms of a syntax-driven calculus of function application from formal semantics.

At one hundred and ten (110) pages this is going to take a while to read and even longer to digest. What I have read so far is both informative and surprisingly, for the subject area, quite pleasant reading.

Thoughts about building up a subject identification by composition?

Enjoy!

I first saw this in a tweet by Stefano Bertolo.

July 1, 2014

Flambo

Filed under: Clojure,Spark — Patrick Durusau @ 7:19 pm

Flambo

From the webpage:

Flambo is a Clojure DSL for Spark. It allows you to create and manipulate Spark data structures using idiomatic Clojure.

The README is the recommended place to get started.

Cool!

Introduction to Python for Econometrics, Statistics and Data Analysis

Filed under: Data Analysis,Python,Statistics — Patrick Durusau @ 7:04 pm

Introduction to Python for Econometrics, Statistics and Data Analysis by Kevin Sheppard.

From the introduction:

These notes are designed for someone new to statistical computing wishing to develop a set of skills necessary to perform original research using Python. They should also be useful for students, researchers or practitioners who require a versatile platform for econometrics, statistics or general numerical analysis (e.g. numeric solutions to economic models or model simulation).

Python is a popular general purpose programming language which is well suited to a wide range of problems. 1 Recent developments have extended Pythonโ€™s range of applicability to econometrics, statistics and general numerical analysis. Python โ€“ with the right set of add-ons โ€“ is comparable to domain-specific languages such as MATLAB and R. If you are wondering whether you should bother with Python (or another language), a very incomplete list of considerations includes:

One of the more even-handed introductions I have read in a long time.

Enough examples and exercises to build some keyboard memory into your fingers! ๐Ÿ˜‰

Bookmark this text so you can forward the link to others.

I first saw this in a tweet by yhat.

Slaying Data Silos?

Filed under: Data Silos,Data Virtualization,Integration — Patrick Durusau @ 6:50 pm

Krishnan Subramanian’s Modern Enterprise: Slaying the Silos with Data Virtualization keeps coming up in my Twitter feed.

In speaking of breaking down data silos, Krishnan says:

A much better approach to solving this problem is abstraction through data virtualization. It is a powerful tool, well suited for the loose coupling approach prescribed by the Modern Enterprise Model. Data virtualization helps applications retrieve and manipulate data without needing to know technical details about each data store. when implemented, organizational data can be easily accessed using a simple REST API.

Data Virtualization (or an abstracted Database as a Service) plugs into the Modern Enterprise Platform as a higher-order layer, offering the following advantages:

  • Better business decisions due to organization wide accessibility of all data
  • Higher organizational agility
  • Loosely coupled services making future proofing easier
  • Lower cost

I find that troubling because there is no mention of data integration.

In fact, in more balanced coverage of data virtualization, which recites the same advantages as Krishnan, we read:

For some reason there are those who sell virtualization software and cloud computing enablement platforms who imply that data integration is something that comes along for the ride. However, nothing gets less complex and data integration still needs to occur between the virtualized data stores as if they existed on their own machines. They are still storing data in different physical data structures, and the data must be moved or copied, and the difference with the physical data structures dealt with, as well as data quality, data integrity, data validation, data cleaning, etc. (The Pros and Cons of Data Virtualization)

Krishnan begins his post:

There’s a belief that cloud computing breaks down silos inside enterprises. Yes, the use of cloud and DevOps breaks down organizational silos between different teams but it only solves part of the problem. The bigger problem is silos between data sources. Data silos, as I would like to refer the problem, is the biggest bottlenecks enterprises face as they try to modernize their IT infrastructure. As I advocate the Modern Enterprise Model, many people ask me what problems they’ll face if they embrace it. Today I’ll do a quick post to address this question at a more conceptual level, without getting into the details.

If data silos are the biggest bottleneck enterprises face, why is the means to address that, data integration, a detail?

Every hand waving approach to data integration fuels unrealistic expectations, even among people who should know better.

There are no free lunches and there are no free avenues for data integration.

(Functional) Reactive Programming (FRP) [tutorial]

The introduction to Reactive Programming you’ve been missing by Andre Staltz.

From the post:

So you’re curious in learning this new thing called (Functional) Reactive Programming (FRP).

Learning it is hard, even harder by the lack of good material. When I started, I tried looking for tutorials. I found only a handful of practical guides, but they just scratched the surface and never tackled the challenge of building the whole architecture around it. Library documentations often don’t help when you’re trying to understand some function. I mean, honestly, look at this:

Rx.Observable.prototype.flatMapLatest(selector, [thisArg])

Projects each element of an observable sequence into a new sequence of observable sequences by incorporating the element’s index and then transforms an observable sequence of observable sequences into an observable sequence producing values only from the most recent observable sequence.

Holy cow.

I’ve read two books, one just painted the big picture, while the other dived into how to use the FRP library. I ended up learning Reactive Programming the hard way: figuring it out while building with it. At my work in Futurice I got to use it in a real project, and had the support of some colleagues when I ran into troubles.

The hardest part of the learning journey is thinking in FRP. It’s a lot about letting go of old imperative and stateful habits of typical programming, and forcing your brain to work in a different paradigm. I haven’t found any guide on the internet in this aspect, and I think the world deserves a practical tutorial on how to think in FRP, so that you can get started. Library documentation can light your way after that. I hope this helps you.

Andre is moving in the right direction when he announces:

FRP is programming with asynchronous data streams.

Data streams. I have been hearing that a lot lately. ๐Ÿ˜‰

The view that data is static, file based, etc., was an artifact of our storage and processing technology. Not that data “streams” is a truer view of data but it is a different one.

The semantic/subject identity issues associated with data don’t change whether you have a static or stream view of data.

Although, with data streams, the processing requirements for subject identity become different. For example, with static data a change (read merger) can propagate throughout a topic map.

With data streams, there may be no retrospective application of a new merging rule, it may only impact data streams going forward. Your view of the topic map becomes a time-based snapshot of the current state of a merged data stream.

If you are looking for ways to explore such issues, FRP and this tutorial are a good place to start.

The Proceedings of the Old Bailey, 1674-1913

Filed under: History,Language — Patrick Durusau @ 4:00 pm

The Proceedings of the Old Bailey, 1674-1913

From the webpage:

A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London’s central criminal court. If you are new to this site, you may find the Getting Started and Guide to Searching videos and tutorials helpful.

While writing about using The WORD on the STREET for examples of language change, I remember the proceedings from Old Bailey being online.

An extremely rich site with lots of help for the average reader but there was one section in particular I wanted to point out:

Gender in the Proceedings

Men’s and women’s experiences of crime, justice and punishment

Virtually every aspect of English life between 1674 and 1913 was influenced by gender, and this includes behaviour documented in the Old Bailey Proceedings. Long-held views about the particular strengths, weaknesses, and appropriate responsibilities of each sex shaped everyday lives, patterns of crime, and responses to crime. This page provides an introduction to gender roles in this period; a discussion of how they affected crime, justice, and punishment; and advice on how to analyse the Proceedings for information about gender.

Gender relations are but one example of the semantic distance that exists between us and our ancestors. We cannot ever eliminate that distance, any more than we can talk about the moon without remembering we have walked upon it.

But, we can do our best to honor that semantic distance by being aware that their world is not ours. Closely attending to language is a first step in that direction.

Enjoy!

The WORD on the STREET

Filed under: Data,History,News — Patrick Durusau @ 3:31 pm

The WORD on the STREET

From the webpage:

In the centuries before there were newspapers and 24-hour news channels, the general public had to rely on street literature to find out what was going on. The most popular form of this for nearly 300 years was ‘broadsides’ – the tabloids of their day. Sometimes pinned up on walls in houses and ale-houses, these single sheets carried public notices, news, speeches and songs that could be read (or sung) aloud.

The National Library of Scotland’s online collection of nearly 1,800 broadsides lets you see for yourself what ‘the word on the street’ was in Scotland between 1650 and 1910. Crime, politics, romance, emigration, humour, tragedy, royalty and superstitions – all these and more are here.

Each broadside comes with a detailed commentary and most also have a full transcription of the text, plus a downloadable PDF facsimile. You can search by keyword, browse by title or browse by subject.

Take a look, and discover what fascinated our ancestors!

An excellent resource for examples of the changing meanings of words over time.

For example, what do you think “sporting” means?

Ready? Try A List of Sporting Ladies…to that their Pleasure at Kelso Races to see if your answer matches that given by the collectors.

BTW, the browsing index will remind you of modern news casts, covering accidents, crime, executions, politics, transvestites, war and other staples of the news industry.

Credulity Question for Interviewees

Filed under: NSA,Skepticism — Patrick Durusau @ 2:48 pm

Max Fisher authored: Map: The 193 foreign countries the NSA spies on and the 4 it doesn’t, which has the following map:

nsa authority map

Max covers the history of the authority of the NSA to spy on governments, organizations, etc., so see his post for the details.

A credulity question for interviewees:

What countries are being spied upon by the NSA without permission? Color in those countries with a #2 pencil.

If they make no changes to the map, you can close the interview early. (The correct answer is six, including the United States.)

Clearly a candidate for phishing attacks, violation of security protocols, pass phrase/password sharing, frankly surprised they made it to the interview.

Domeo and Utopia for PDF…

Filed under: Annotation,PDF — Patrick Durusau @ 2:19 pm

Domeo and Utopia for PDF, Achieving annotation interoperability by Paolo Ciccarese.

From the description:

The Annotopia (github.com/Annotopia/) Open Annotation universal Hub allows to achieve annotation interoperability between different annotation clients. This is a first small demo where the annotations created with the Domeo Web Annotation Tool (annotationframework.org/) can be seen by the users of the Utopia for PDF application (utopiadocs.com/).

The demonstration shows highlighting of text and attachment of a note to an HTML page in a web browser and then the same document is loaded as PDF and the highlighting and note appear as specified in the HTML page.

The Domeo Web Annotation Tool appears to have the capacity to be a topic map authoring tool against full text.

Definite progress on the annotation front!

Next question is how do we find all the relevant annotations despite differences in user terminology? Same problem that we have with searching but in annotations instead of the main text.

You could start from some location in the text but I’m not sure all users will annotate the same material. Some may comment on the article in general, others, will annotate very specific text.

Definitely a topic map issue both in terms of subjects in the text as well as in the annotations.

Piketty in R markdown

Filed under: Ecoinformatics,Open Data,R — Patrick Durusau @ 11:56 am

Piketty in R markdown – we need some help from the crowd by Jeff Leek.

From the post:

Thomas Piketty’s book Capital in the 21st Century was a surprise best seller and the subject of intense scrutiny. A few weeks ago the Financial Times claimed that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the London School of economics posted a similar call to make the data open and machine readable saying.

None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.

A few friends of Simply Stats had started on a project to translate his work from the excel files where the original analysis resides into R. The people that helped were Alyssa Frazee, Aaron Fisher, Bruce Swihart, Abhinav Nellore, Hector Corrada Bravo, John Muschelli, and me. We haven’t finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book’s technical appendices into R markdown documents. If you are interested, please send pull requests to the gh-pages branch of this Github repo.

Hmmm, debate to be conducted based on known data sets?

That sounds like a radical departure from most public debates, to say nothing of debates in politics.

Dangerous because the general public may come to expect news reports, government budgets, documents, etc. to be accompanied by machine readable data files.

Even more dangerous if data files are compared to other data files, for consistency, etc.

No time to start like the present. Think about helping with the Piketty materials.

You may be helping to start a trend.

Visualizing Philosophers And Scientists

Filed under: D3,NLTK,Scikit-Learn,Visualization,Word Cloud — Patrick Durusau @ 10:31 am

Visualizing Philosophers And Scientists By The Words They Used With Python and d3.js by Sahand Saba.

From the post:

This is a rather short post on a little fun project I did a couple of weekends ago. The purpose was mostly to demonstrate how easy it is to process and visualize large amounts of data using Python and d3.js.

With the goal of visualizing the words that were most associated with a given scientist or philosopher, I downloaded a variety of science and philosophy books that are in the public domain (project Gutenberg, more specifically), and processed them using Python (scikit-learn and nltk), then used d3.js and d3.js cloud by Jason Davies (https://github.com/jasondavies/d3-cloud) to visualize the words most frequently used by the authors. To make it more interesting, only words that are somewhat unique to the author are displayed (i.e. if a word is used frequently by all authors then it is likely not that interesting and is dropped from the results). This can be easily achieved using the max_df parameter of the CountVectorizer class.

I pass by Copleston’s A History of Philosophy several times a day. It is a paperback edition from many years ago that I keep meaning to re-read.

At least for philosophers with enough surviving texts in machine readable format, perhaps Sahand’s post will provide the incentive to return to reading Copleston. A word cloud is one way to explore a text. Commentary, such as Copleston’s, is another.

What other tools would you use with philosophers and a commentary like Copleston?

I first saw this in a tweet by Christophe Viau.

« Newer Posts

Powered by WordPress