Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 28, 2011

An Introduction to Data Mining

Filed under: Data Mining — Patrick Durusau @ 6:58 pm

An Introduction to Data Mining by Dr. Saed Sayad

A very interesting map of data mining, the nodes of which lead to short articles on particular topics.

It is a useful resource for reviewing material on data mining, either as part of a course or for self-study.

While not part of the map, don’t miss the Further Readings link in the bottom left-hand corner.

March 26, 2011

Topic Modeling Browser (LDA)

Topic Modeling Browser (LDA)

From a post by David Blei:

allison chaney has created the “topic model visualization engine,” which can be used to create browsers of document collections based on a topic model. i think this will become a very useful tool for us. the code is on google code:
http://code.google.com/p/tmve/
as an example, here is a browser built from a 50-topic model fit to 100K articles from wikipedia:
http://www.sccs.swarthmore.edu/users/08/ajb/tmve/wiki100k/browse/topic-list.html
allison describes how she built the browser in the README for her code:
http://code.google.com/p/tmve/wiki/TMVE01
finally, to check out the code and build your own browser, see here:
http://code.google.com/p/tmve/source/checkout

Take a look.

As I have mentioned before, LDA could be a good exploration tool for document collections, preparatory to building a topic map.

March 20, 2011

Overview of Text Extraction Algorithms

Filed under: Data Mining,Text Extraction — Patrick Durusau @ 1:24 pm

Overview of Text Extraction Algorithms

Short review and pointers to posts by computer science student Tomaž Kova?i?e listing resources for text extraction.

If you are building topic maps based on text extraction from web pages in particular, well worth the time to take a look.

March 18, 2011

Learning Data Science Skills

Filed under: Data Mining — Patrick Durusau @ 6:51 pm

Learning Data Science Skills

Christopher Bare has a useful collection of links to resources for wannabe data scientists.

Interested to know what tools, tutorials, etc. that you have found to be the most helpful.

March 15, 2011

2010 Data Miner Survey

Filed under: Data Mining,News — Patrick Durusau @ 5:29 am

2010 Data Miner Survey

R comes out as the #1 tool which is how I heard about the survey.

I have requested a copy, mostly so I can see what other tools are reported as used by data miners.

Topic maps start with the discovery of data that becomes part of or subject to a topic map and end with the delivery of data to a user.

They are data, end to end.

March 10, 2011

evo*2011

Filed under: Data Mining,Evoluntionary,Machine Learning — Patrick Durusau @ 12:32 pm

evo*2011

From the website:

evo* comprises the premier co-located conferences in the field of Evolutionary Computing: eurogp, evocop, evobio and evoapplications.

Featuring the latest in theoretical and applied research, evo* topics include recent genetic programming challenges, evolutionary and other meta-heuristic approaches for combinatorial optimization, evolutionary algorithms, machine learning and data mining techniques in the biosciences, in numerical optimization, in music and art domains, in image analysis and signal processing, in hardware optimization and in a wide range of applications to scientific, industrial, financial and other real-world problems.

Conference is 27-29 April 2011 in Torino, Italy.

Even if you are not in the neighborhood, the paper abstracts make an interesting read!

Top Ten Algorithms in Data Mining

Filed under: Algorithms,Data Mining — Patrick Durusau @ 9:13 am

Top Ten Algorithms in Data Mining

Summary of paper on data mining algorithms nominated and voted on by ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to come up with a top 10 list.

I was curious about how the entries on the list from 2007 have fared.

I searched CiteseerX limiting the publication year to 2010.

The results, algorithm followed by citation count, were as follows:

  1. C4.5 – 41
  2. The k-Means algorithm – 86
  3. Support Vector Machines – 64
  4. The Apriori algorithm – 46
  5. Expectation-Maximization – 41
  6. PageRank – 19
  7. AdaBoost – 11
  8. k-Nearest Neighbor Classification – 36*
  9. Naive Bayes – 25
  10. CART (Classification and Regression Trees) – 11

*Searched as “k-Nearest Neighbor”.

Not a scientific study but enough variation to make me curious about:

  1. Broader survey of algorithm citation.
  2. What articles cite more than one algorithm?
  3. Are there any groupings by subject of study?

Not a high priority item but something I want to return to examine more closely.

March 8, 2011

Summify’s Technology Examined

Filed under: Data Analysis,Data Mining,MongoDB,MySQL,Redis — Patrick Durusau @ 9:54 am

Summify’s Technology Examined

Phil Whelan writes an interesting review of the underlying technology for Summify.

Many those same components are relevant to the construction of topic map based services.

Interesting that Summify uses MySQL, Redis and MongoDB.

I rather like the idea of using the best tool for a particular job.

Worth a close read.

March 7, 2011

logstash

Filed under: Data Mining — Patrick Durusau @ 7:06 am

logstash

From the website:

logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching). Speaking of searching, logstash comes with a web interface for searching and drilling into all of your logs.

I mention this for two reasons:

First, obviously as a tool for mining/searching logs. Deciding what subjects in a log will later appear in a topic map starts with discovery of those subjects.

Secondly, perhaps less obviously, thinking that adding subject identity to events discovered in logs could enable mapping across logs, say for example that were mining TCP/IP packet traffic.

Can’t imagine why anyone would be sitting on or near a big switch doing that, ;-), but just to cover all the edge cases.

If you filtered out all the known porn site and search engine traffic, both of which are large but knowable lists, the amount of stuff you have to process starts to look pretty manageable.

Does anyone know the ratio of porn/search to other traffic into the Pentagon? Or Congress? Just curious if there is a useful baseline.

March 4, 2011

Metaoptimize Q+A

Metaoptimize Q+A is one of the Q/A sites I just stumbled across.

From the website:

A community of scientists interested in machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization, as well as adjacent topics.

Looks like an interesting place to hang out.

Third Cross Validated Journal Club

Filed under: Data Mining,Statistics — Patrick Durusau @ 6:08 am

Third Cross Validated Journal Club

From the posting:

  • CVJC is a whole day meeting on chat where we discuss some paper and its theoretical/practical surroundings.
  • As mentioned above the event is whole-day (00:00-23:59UTC), but there are three meet-up sessions at 1:00, 9:00 and 16:00UTC on which most talking take place; they are spread over day to put at least one CVJC session in reach regardless of time zone.
  • The paper must be OpenAccess or a (p)reprint suggested previously on a meta thread like this one and selected in voting.
  • I would try to invite the author (it worked last time).

See the posting for the proposal for the next Cross Validated meeting date and discussion material.

Thinking something like this could be of interest in the topic maps community.

Cross Validated

Filed under: Data Mining,Statistics,Visualization — Patrick Durusau @ 5:58 am

Cross Validated

From the website:

This is a collaboratively edited question and answer site for statisticians, data analysts, data miners and data visualization experts. It’s 100% free, no registration required.

This is one of a series of such Q/A sites that I am going to be listing as of possible interest to the topic maps community.

March 2, 2011

RKWard

Filed under: Data Mining,R — Patrick Durusau @ 10:22 am

RKWard

Another R IDE for data mining. Thought I should mention it since I also posted a note about RStudio.

From the website:

RKWard is meant to become an easy to use, transparent frontend to the R-language, a very powerful, yet hard-to-get-into scripting-language with a strong focus on statistic functions. It will not only provide a convenient user-interface, however, but also take care of seamless integration with an office-suite. Practical statistics is not just about calculating, after all, but also about documenting and ultimately publishing the results.

RKWard then is (will be) something like a free replacement for commercial statistical packages. In addition to ease of use, three aspects are particularily important:

  • It will be a transparent interface to the underlying R-language. That is, it will not hide the powerful syntax, but merely provide a convenient way, in which both newbies and R-experts can accomplish most of their tasks. A GUI can never provide an interface to the whole power of a language like R. In some cases users will want to tweak some functions to their particular needs and esp. to automate some tasks. By making the “inner workings” visible to the user, RKWard will make it easy for the user to see where and how to use R-syntax to accomplish their goals.
  • For the output, RKWard strives to separate content and design to a high degree. It will not try to design its own tables/graphs, etc, which have to be converted to the style used in the rest of a publication by hand. Currently RKWard uses HTML for its output. Using appropriate style definitions reformatting this output to match the rest of the publication will be easily doable. In future releases RKWard will even seek stronger integration with existing office suites.
  • It relies on a language, that is not only very powerful, but also extensible, and for which dozens of extensions already exist.

And of course, it is free (as in free speech).

RStudio

Filed under: Data Mining,R — Patrick Durusau @ 7:08 am

RStudio

From the website:

RStudio™ is a new integrated development environment (IDE) for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R.

Productive

RStudio brings together everything you need to be productive with R in a single, customizable environment. Its intuitive interface and powerful coding tools help you get work done faster.

Runs Everywhere

RStudio is available for all major platforms including Windows, Mac OS X, and Linux. It can even run alongside R on a server, enabling multiple users to access the RStudio IDE using a web browser.

Free & Open

Like R, RStudio is available under an open source license that guarantees the freedom to share and change the software, and to make sure it remains free software for all its users.

I first saw this mentioned at:

RStudio: An Open Source and Cross-Platform IDE for R

OSCON Data 2011 Call for Participation

Filed under: Conferences,Data Analysis,Data Mining,Data Models,Data Structures — Patrick Durusau @ 7:07 am

OSCON Data 2011 Call for Participation

Deadline: 11:59pm 03/14/2011 PDT

From the website:

The O’Reilly OSCON Data conference is the first of its kind: bringing together open source culture and data hackers to cover data management at a very practical level. From disks and databases through to big data and analytics, OSCON Data will have instruction and inspiration from the people who actually do the work.

OSCON Data will take place July 25-27, 2011, in Portland, Oregon. We’ll be co-located with OSCON itself.

Proposals should include as much detail about the topic and format for the presentation as possible. Vague and overly broad proposals don’t showcase your skills and knowledge, and our volunteer reviewers aren’t mind readers. The more you can tell us, the more likely the proposal will be selected.

Proposals that seem like a “vendor pitch” will not be considered. The purpose of OSCON Data is to enlighten, not to sell.

Submit a proposal.

Yes, it is right before Balisage but I think worth considering if you are on the West Coast and can’t get to Balisage this year or if you are feeling really robust. 😉

Hmmm, I wonder how a proposal that merges the indexes of the different NoSQL volumes from O’Reilly would be received? You are aware that O’Reilly is re-creating the X-Windows problem that was the genesis of both topic maps and DocBook?

I will have to write that up in detail at some point. I wasn’t there but have spoken to some of the principals who were. Plus I have the notes, etc.

March 1, 2011

Social Data and Log Analysis Using MongoDB

Filed under: Data Mining,Log Analysis,MongoDB — Patrick Durusau @ 11:33 am

Social Data and Log Analysis Using MongoDB

Interesting use of MongoDB.

Work through the slide deck and consider the following questions along the way:

  1. How would your analysis of the logs (the process of analysis) be different if you were using topic maps?
  2. How would your results from #1 be different?
  3. Choose a set of logs and test your answers to #1 and #2.

(Credit will be equally rewarded whether #3 confirms or contradicts your analysis in #1 and #2. The purpose of the exercise is to develop a “fee” for fruitful areas of investigation.)

InTech – Open Access Publisher

Filed under: Books,Data Mining,Self-Organizing — Patrick Durusau @ 10:18 am

I scan lightly before I clean out my spam filter for the blog and saw:

Hello. Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!

I was curious enough to follow the links and was glad I did.

InTech – Open Access Publisher has a number of volumes for downloading that may interest topic mappers. For free!

At first I thought these were article collections, made up of conference and other papers. I have only spot checked Self Organizing Maps – Applications and Novel Algorithm Design, edited by Josphat Igadwa Mwasiagi, but none of the paper titles appear in web searches, other than at Intechweb.org.

Apologies for appearing suspicious but there is so much re-cycled content on the WWW these days. That does not appear to be the case here, which is welcome news!

Would appreciate hearing of the experience of others with volumes from this site.

February 28, 2011

YouTube Topic Map?

Filed under: Authoring Topic Maps,Data Mining,Topic Maps — Patrick Durusau @ 10:55 am

Is anyone working or thinking about working on a topic map for YouTube?

I ask because while I can eventually find search terms that will narrow the videos down to a set of lectures, they are disorderly and have duplicates.

If someone is working on a project that would include CS lectures and similar offerings, I would be willing to contribute some editing/sorting of data.

Probably not the most popular subject for a community based topic map. 😉

I might be willing to contribute some editing/sorting of data for more popular topic maps as well. Depends on the topic. (sorry!)

Suggestions (with a link to a representative YouTube video) welcome!

You can even conceal your identity! I won’t out you for liking the sneezing panda video.

R Fundamentals and Programming Techniques

Filed under: Authoring Topic Maps,Data Mining,R — Patrick Durusau @ 8:33 am

R Fundamentals and Programming Techniques

Thomas Lumley on R.

One of the strengths and weaknesses of the topic map standardization effort was that it presumed you already had a topic map.

A strength because the methods for arriving at a topic map remain unbounded and unsullied by choices (and limitations) of languages, approaches, etc.

A weakness because the topic map novice is left in the position of a tourist who marvels at a medieval cathedral but has no idea how to build one themselves. (Well, ok, perhaps that is a bit of a stretch. 😉 )

The fact remains there is are ever increasing amounts of data becoming available, many of which are just crying out for topic maps to be built for their navigation.

R is one of the currently popular data mining languages that can be pressed into service for the exploration of data and construction of topic maps.

Definitely a resource to explore and exploit before you invest in any of the printed R reference materials.

February 24, 2011

ICWSM 2011 Data Challenge

Filed under: Conferences,Data Mining,Dataset — Patrick Durusau @ 12:21 pm

ICWSM 2011 Data Challenge

From the website:

The ICWSM 2011 Data Challenge introduces a brand-new dataset, the 2011 ICWSM Spinn3r dataset. This dataset includes blogs from Spinn3r over a 33 day period, from January 13th, 2011 through February 14th, 2011. See here for details on how to obtain the collection.

Since the new collection spans some rather extraordinary world events, this year introduces a specific task: to locate significant posts in the collection which are relevant to the revolutions in Tunisia and Egypt. The criterion for “significant relevance” is that the post is worthy of being shared by you, an observer, with a friend. To participate in the task, we will ask that you submit a ranked list of items in the collection, and we will do some form of relevance judgments and scoring in time for the conference.

The data challenge will culminate at ICWSM 2011 with a special workshop. To participate in the workshop, you must submit a 3-page short paper in PDF format and bring a poster to present at the workshop. The short papers will not be reviewed, but the workshop organizers will select a small panel of speakers based on the submissions. The short paper/poster can describe your participation in the shared task, OR ALTERNATIVELY other compelling work you have performed WITH THE 2011 DATASET.

Submissions will be due on April 22, 2011. Details on the submission process will be posted soon.

Oh, just briefly about the collection:

The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset’s time period).

If you are going to be in Barcelona (the conference location), why not submit an entry using topic maps?

February 14, 2011

Data Mining Video

Filed under: Data Mining — Patrick Durusau @ 1:44 pm

How I OCR hundreds of hours of video.

A very useful posting for anyone interested in mining the text overlays displayed during TV coverage.

Here the context is legislative coverage but I assume the same principles apply in other contexts.

One topic map aspect would be to create mappings to other materials involving the same parties or issues.

“Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 10:39 am

“Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011

All the materials from the “Data Bookcamp.”

I haven’t had time to review the materials but am looking forward to it.

February 11, 2011

Dealing with Data

Filed under: Data Analysis,Data Mining,Marketing — Patrick Durusau @ 12:45 pm

Dealing with Data

From the website:

In the 11 February 2011 issue, Science joins with colleagues from Science Signaling, Science Translational Medicine, and Science Careers to provide a broad look at the issues surrounding the increasingly huge influx of research data. This collection of articles highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.

Science is making access to this entire collection FREE (simple registration is required for non-subscribers).

The growing concern over the influx of data represents a golden marketing opportunity for topic maps!

First, the predictions about increasing amounts of data are coming true.

That means impressive numbers to cite and even more impressive predictions about the future.

Second, the coming data deluge represents a range of commercial opportunities.

Opportunities for reuse, comparison, and mining such data abound. And, only increase as more data comes online.

Are you going to be the Facebook of some data area?

Third, and the reason unique to topic maps:

The format that contains data is recognized as composed of subjects.

Subjects that can be identified, placed in associations, have properties added to them,

That one insight is critical to re-use, combination and comparison of data in the data deluge.

If you identify the subjects that compose those structures, as well as the subject thought to be recognized by those data structures, you can then create maps between diverse data sets.

It is the identification of subjects that enables the creation and interchange of maps of where to swim in this vast sea of data.

*****
PS: I am going to take a slow walk through these articles and will be posting about opportunities that I see for topic maps. Your comments/feedback welcome!

Data talks and keynotes from O’Reilly Strata conference

Filed under: Conferences,Data Mining — Patrick Durusau @ 6:30 am

Data talks and keynotes from O’Reilly Strata conference high lighted by FlowingData.com.

Embedded at FlowingData:

  • Hilary Mason, “What Data Tells Us”
  • Mark Madsen, “The Mythology of Big Data”
  • Werner Vogels, “Data Without Limits”

Other presentations and interviews are on YouTube.

February 3, 2011

PyBrain: The Python Machine Learning Library

PyBrain: The Python Machine Learning Library

From the website:

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

How is PyBrain different?

While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. We are constantly working on more and faster algorithms, developing new environments and improving usability.

What PyBrain can do

PyBrain, as its written-out name already suggests, contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Another tool kit to assist in the construction of topic maps.

And another likely contender for the Topic Map Competition!

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

January 26, 2011

GSoC 2010 mid-term: Graph Streaming API – Post

Filed under: Data Mining,Gephi,Graphs,Visualization — Patrick Durusau @ 6:08 am

GSoC 2010 mid-term: Graph Streaming API by André Panisson.

From the blog:

The purpose of the Graph Streaming API project, run by André Panisson, is to build a unified framework for streaming graph objects. Gephi’s data structure and visualization engine has been built with the idea that a graph is not static and might change continuously. By connecting Gephi with external data-sources, we leverage its power to visualize and monitor complex systems or enterprise data in real-time. Moreover, the idea of streaming graph data goes beyond Gephi, and a unified and standardized API could bring interoperability with other available tools for graph and network analysis, as they could start to interoperate with other tools in a distributed and cooperative fashion.

There are times when no comment seems adequate. This is one of those times.

Read the post, play with the code, follow the work (and support it!).

January 22, 2011

40 Fascinating Blogs for the Ultimate Statistics Geek – Post

Filed under: Data Mining,Statistics — Patrick Durusau @ 1:29 pm

40 Fascinating Blogs for the Ultimate Statistics Geek

A varied collection of blogs on statistics.

Either for data mining, modeling or interpreting the data mining/modeling of others, you are going to need statistics.

Blogs are not a replacement for a good statistics book and a copy of Mathematica but it’s a place to start.

January 20, 2011

IMMM 2011: The First International Conference on Advances in Information Mining and Management

Filed under: Conferences,Data Mining,Information Retrieval,Searching — Patrick Durusau @ 7:40 pm

IMMM 2011: The First International Conference on Advances in Information Mining and Management.

July 17-22, 2011 – Bournemouth, UK

See the Call for Papers for details but general areas include:

  • Mining mechanisms and methods
  • Mining support
  • Type of information mining
  • Pervasive information retrieval
  • Automated retrieval and mining
  • Mining features
  • Information mining and management
  • Mining from specific sources
  • Data management in special environments
  • Mining evaluation
  • Mining tools and applications

Important deadlines:
Submission (full paper) March 1, 2011
Notification April 10 , 2011
Registration April 25, 2011
Camera ready April 28, 2011

January 19, 2011

Scrapy

Filed under: Data Mining,Searching,Software — Patrick Durusau @ 1:34 pm

Scrapy

From the website:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Another tool to assist with data gathering for topic map authoring.

« Newer PostsOlder Posts »

Powered by WordPress