Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 25, 2012

Tesseract – Fast Multidimensional Filtering for Coordinated Views

Filed under: Analytics,Dataset,Filters,Multivariate Statistics,Visualization — Patrick Durusau @ 7:16 pm

Tesseract – Fast Multidimensional Filtering for Coordinated Views

From the post:

Tesseract is a JavaScript library for filtering large multivariate datasets in the browser. Tesseract supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Tesseract uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Tesseract works, see the API reference.

Are you ready to “slice and dice” your data set?

March 22, 2012

Vista Stares Deep Into the Cosmos:…

Filed under: Astroinformatics,Data,Dataset — Patrick Durusau @ 7:42 pm

Vista Stares Deep Into the Cosmos: Treasure Trove of New Infrared Data Made Available to Astronomers

From the post:

The European Southern Observatory’s VISTA telescope has created the widest deep view of the sky ever made using infrared light. This new picture of an unremarkable patch of sky comes from the UltraVISTA survey and reveals more than 200 000 galaxies. It forms just one part of a huge collection of fully processed images from all the VISTA surveys that is now being made available by ESO to astronomers worldwide. UltraVISTA is a treasure trove that is being used to study distant galaxies in the early Universe as well as for many other science projects.

ESO’s VISTA telescope has been trained on the same patch of sky repeatedly to slowly accumulate the very dim light of the most distant galaxies. In total more than six thousand separate exposures with a total effective exposure time of 55 hours, taken through five different coloured filters, have been combined to create this picture. This image from the UltraVISTA survey is the deepest [1] infrared view of the sky of its size ever taken.

The VISTA telescope at ESO’s Paranal Observatory in Chile is the world’s largest survey telescope and the most powerful infrared survey telescope in existence. Since it started work in 2009 most of its observing time has been devoted to public surveys, some covering large parts of the southern skies and some more focused on small areas. The UltraVISTA survey has been devoted to the COSMOS field [2], an apparently almost empty patch of sky which has already been extensively studied using other telescopes, including the NASA/ESA Hubble Space Telescope [3]. UltraVISTA is the deepest of the six VISTA surveys by far and reveals the faintest objects.

Another six (6) terabytes of images, just in case you are curious.

And the rate of acquisition of astronomical data is only increasing.

Clever insights into how to more efficiently process and analyze the resulting data are surely welcome.

Einstein Archives Online

Filed under: Archives,Dataset — Patrick Durusau @ 7:41 pm

Einstein Archives Online

From the “about” page:

The Einstein Archives Online Website provides the first online access to Albert Einstein’s scientific and non-scientific manuscripts held by the Albert Einstein Archives at the Hebrew University of Jerusalem, constituting the material record of one of the most influential intellects in the modern era. It also enables access to the Einstein Archive Database, a comprehensive source of information on all items in the Albert Einstein Archives.

DIGITIZED MANUSCRIPTS

From 2003 to 2011, the site included approximately 3,000 high-quality digitized images of Einstein’s writings. This digitization of more than 900 documents written by Einstein was made possible by generous grants from the David and Fela Shapell family of Los Angeles. As of 2012, the site will enable free viewing and browsing of approximatelly 7,000 high-quality digitized images of Einstein’s writings. The digitization of close to 2,000 documents written by Einstein was produced by the Albert Einstein Archives Digitization Project and was made possible by the generous contribution of the Polonsky foundation. The digitization  project will continue throughout 2012.

FINDING AID

The site enables access to the online version of the Albert Einstein Archives Finding Aid, a comprehensive description of the entire repository of Albert Einstein’s personal papers held at the Hebrew University. The Finding Aid, presented in Encoded Archival Description (EAD) format, provides the following information on the Einstein Archives: its identity, context, content, structure, conditions of access and use. It also contains a list of the folders in the Archives which will enable access to the Archival Database and to the Digitized Manuscripts.

ARCHIVAL DATABASE

From 2003 to 2011, the Archival Database included approximately 43,000 records of Einstein and Einstein-related documents. Supplementary archival holdings and databases pertaining to Einstein documents have been established at both the Einstein Papers Project and the Albert Einstein
Archives
for scholarly research. As of 2012 the Archival Database allows direct access to all 80,000 records of Einstein and Einstein-related documents in the original and the supplementary archive. The records published in this online version pertain to Albert Einstein’s scientific and non-scientific writings, his professional and personal correspondence, notebooks, travel diaries, personal documents, and third-party items contained in both the original collection of Einstein’s personal papers and in the supplementary archive.

Unless you are a professional archivist, I suspect you will want to start with the Gallery. Which for some UI design reason appears at the bottom of the homepage in small type. (Hint: It really should be a logo at top left, to interest the average visitor.)

When you do reach mss. images, the zoom/navigation is quite responsive, although a slightly larger image to clue the reader in on location would be better. In fact, one that is readable and yet subject to zoom would be ideal.

Another improvement would be to display a URL to allow exchange of links to particular images, along with X/Y coordinates to the images. As presented, every reader has to re-find information in images for themselves.

Archiving material is good. Digital archives that enable wider access is better. Being able to reliably point into digital archives for commentary, comparison and other purposes is great.

March 17, 2012

NASA Releases Atlas Of Entire Sky

Filed under: Astroinformatics,Data Mining,Dataset — Patrick Durusau @ 8:19 pm

NASA Releases Atlas Of Entire Sky

J. Nicholas Hoover (InformationWeek) writes:

NASA this week released to the Web an atlas and catalog of 18,000 images consisting of more than 563 million stars, galaxies, asteroids, planets, and other objects in the sky–many of which have never been seen or identified before–along with data on all of those objects.

The space agency’s Wide-field Infrared Survey Explorer (WISE) mission, which was a collaboration of NASA’s Jet Propulsion Laboratory and the University of California Los Angeles, collected the data over the past two years, capturing more than 2.7 million images and processing more than 15 TB of astronomical data along the way. In order to make the data easier to use, NASA condensed the 2.7 million digital images down to 18,000 that cover the entire sky.

The WISE mission, which mapped the entire sky, uncovered a number of never-before-seen objects in the night sky, including an entirely new class of stars and the first “Trojan” asteroid that shares the Earth’s orbital path. The study also determined that there were far fewer mid-sized asteroids near Earth than had been previously thought. Even before the mass release of data to the Web, there have already been at least 100 papers published detailing the more limited results that NASA had already released.

Hoover also says that NASA has developed tutorials to assist developers in working with the data and that the entire database will be available in the not too distant future.

When I see releases like this one, I am reminded of Jim Gray (MS). Jim was reported to like astronomy data sets because they are big and free. See what you think about this one.

March 13, 2012

Common Crawl To Add New Data In Amazon Web Services Bucket

Filed under: Common Crawl,Dataset — Patrick Durusau @ 8:15 pm

Common Crawl To Add New Data In Amazon Web Services Bucket

From the post:

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

That’s good news!

At least I think so.

I am sure like everyone else, I will be trying to find the cycles (or at least thinking about it) to play (sorry, explore) the Common Crawl data set.

I hesitate to say without reservation this is a good thing because my data needs are more modest than searching the entire WWW.

That wasn’t so hard to say. Hurt a little but not that much. 😉

I am exploring how to get better focus on information resources of interest to me. I rather doubt that focus is going to start with the entire WWW as an information space. Will keep you posted.

March 8, 2012

Twitter Current English Lexicon

Filed under: Dataset,Lexicon,Tweets — Patrick Durusau @ 8:50 pm

Twitter Current English Lexicon

From the description:

Twitter Current English Lexicon: Based on the Twitter Stratified Random Sample Corpus, we regularly extract the Twitter Current English Lexicon. Basically, we’re 1) pulling all tweets from the last three months of corpus entries that have been marked as “English” by the collection process (we have to make that call because there is no reliable means provided by Twitter), 2) removing all #hash, @at, and http items, 3) breaking the tweets into tokens, 4) building descriptive and summary statistics for all token-based 1-grams and 2-grams, and 5) pushing the top 10,000 N-grams from each set into a database and text files for review. So, for every top 1-gram and 2-gram, you know how many times it occurred in the corpus, and in how many tweets (plus associated percentages).

This is an interesting set of data, particularly when you compare it with a “regular” English corpus, something traditional like the Brown Corpus. Unlike most corpora, the top token (1-gram) for Twitter is “i” (as in me, myself, and I), there are a lot of intentional misspellings, and you find an undue amount of, shall we say, “callus” language (be forewarned). It’s a brave new world if you’re willing.

To use this data set, we recommend using the database version and KwicData, but you can also use the text version. Download the ZIP file you want, unzip it, then read the README file for more explanation about what’s included.

I grabbed a copy yesterday but haven’t had the time to look at it.

Twitter feed pipeline software you would recommend?

February 28, 2012

StatLib

Filed under: Data,Dataset,Statistics — Patrick Durusau @ 8:41 pm

StatLib

From the webpage:

Welcome to StatLib, a system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW. StatLib started out as an e-mail service and some of the organization still reflects that heritage. We hope that this document will give you sufficient guidance to navigate through the archives. For your convenience there are several sites around the world which serve as full or partial mirrors to StatLib.

An amazing source of software and data. Including sets of webpages for clustering analysis, etc.

Was mentioned in the first R-Podcast episode.

February 22, 2012

The Data Hub

Filed under: Data,Dataset — Patrick Durusau @ 4:48 pm

The Data Hub

From the about page:

What was the average price of a house in the UK in 1935? When will India’s projected population overtake that of China? Where can you see publicly-funded art in Seattle? Data to answer many, many questions like these is out there on the Internet somewhere – but it is not always easy to find.

the Data Hub is a community-run catalogue of useful sets of data on the Internet. You can collect links here to data from around the web for yourself and others to use, or search for data that others have collected. Depending on the type of data (and its conditions of use), the Data Hub may also be able to store a copy of the data or host it in a database, and provide some basic visualisation tools.

I covered the underlying software in CKAN – the Data Hub Software.

If your goal is to simply make data sets available with a minimal amount of metadata, this may be the software for you.

If your goal is to make data sets available with enough metadata to make robust use of them, you need to think again.

There is an impressive amount of data sets at this site.

But junk yards have an impressive number of wrecked cars.

Doesn’t help you find the car with the part you need. (Think data formats, semantics, etc.)

Eurostat

Filed under: Data,Dataset,Government Data,Statistics — Patrick Durusau @ 4:48 pm

Eurostat

From the “about” page:

Eurostat’s mission: to be the leading provider of high quality statistics on Europe.

Eurostat is the statistical office of the European Union situated in Luxembourg. Its task is to provide the European Union with statistics at European level that enable comparisons between countries and regions.

This is a key task. Democratic societies do not function properly without a solid basis of reliable and objective statistics. On one hand, decision-makers at EU level, in Member States, in local government and in business need statistics to make those decisions. On the other hand, the public and media need statistics for an accurate picture of contemporary society and to evaluate the performance of politicians and others. Of course, national statistics are still important for national purposes in Member States whereas EU statistics are essential for decisions and evaluation at European level.

Statistics can answer many questions. Is society heading in the direction promised by politicians? Is unemployment up or down? Are there more CO2 emissions compared to ten years ago? How many women go to work? How is your country’s economy performing compared to other EU Member States?

International statistics are a way of getting to know your neighbours in Member States and countries outside the EU. They are an important, objective and down-to-earth way of measuring how we all live.

I have seen Eurostat mentioned, usually negatively, by data aggregation services. I visited Eurostat today and found it quite useful.

For the non-data professional, there are graphs and other visualizations of popular data.

For the data professional, there are bulk downloads of data and other technical information.

I am sure there is room for improvement specific feedback is required to make that happen. (It has been my experience that positive specific feedback works best. Fine something nice to say and then suggest a change to improve the outcome.)

February 2, 2012

IMDb Alternative Interfaces

Filed under: Data,Dataset,IMDb — Patrick Durusau @ 3:39 pm

IMDb Alternative Interfaces.

From the webpage:

This page describes various alternate ways to access The Internet Movie Database locally by holding copies of the data directly on your system. See more about using our data on the Non-Commercial Licensing page.

It’s an interesting data set and I am sure its owners would not mind your sending them a screencast of some improved access you have created to their data.

That might actually be an interesting model for developing better interfaces to data served up to the public anyway. Release it for strictly personal use and see who does the best job with it. A screencast would not disclose any of your source code or processes, protecting the interest of the software author.

Just a thought.

First noticed this on PeteSearch.

February 1, 2012

One year on: 10 times bigger, masses more data… and a new API

Filed under: Corporate Data,Dataset,Government Data — Patrick Durusau @ 4:35 pm

One year on: 10 times bigger, masses more data… and a new API

From the post:

Was it just a year ago that we launched OpenCorporates, after just a couple months’ coding? When we opened up over 3 million companies and allowed searching across multiple jurisdictions (admittedly there were just three of them to start off with)?

Who would have thought that 12 months later we would have become 10 times bigger, with over 30 million companies and over 35 jurisdictions, and lots of other data too. So we could use this as an example to talk about some of the many milestones in that period, about all the extra data we’ve added, about our commitment to open data, and the principles behind it.

We’re not going to do that however, instead we’d rather talk about the new API we’ve just launched, allowing full access to all the info, and importantly allowing searches via the API too. In fact, we’ve now got a full website devoted to the api, http://api.opencorporates.com, and on it you’ll find all the documentation, example API calls, versioning information, error messages, etc.

Congratulations to OpenCorporates on a stellar year!

The collection of dots to connect has gotten dramatically larger!

January 15, 2012

Pbm: A new dataset for blog mining

Filed under: Blogs,Dataset — Patrick Durusau @ 9:15 pm

Pbm: A new dataset for blog mining by Mehwish Aziz and Muhammad Rafi.

Abstract:

Text mining is becoming vital as Web 2.0 offers collaborative content creation and sharing. Now Researchers have growing interest in text mining methods for discovering knowledge. Text mining researchers come from variety of areas like: Natural Language Processing, Computational Linguistic, Machine Learning, and Statistics. A typical text mining application involves preprocessing of text, stemming and lemmatization, tagging and annotation, deriving knowledge patterns, evaluating and interpreting the results. There are numerous approaches for performing text mining tasks, like: clustering, categorization, sentimental analysis, and summarization. There is a growing need to standardize the evaluation of these tasks. One major component of establishing standardization is to provide standard datasets for these tasks. Although there are various standard datasets available for traditional text mining tasks, but there are very few and expensive datasets for blog-mining task. Blogs, a new genre in web 2.0 is a digital diary of web user, which has chronological entries and contains a lot of useful knowledge, thus offers a lot of challenges and opportunities for text mining. In this paper, we report a new indigenous dataset for Pakistani Political Blogosphere. The paper describes the process of data collection, organization, and standardization. We have used this dataset for carrying out various text mining tasks for blogosphere, like: blog-search, political sentiments analysis and tracking, identification of influential blogger, and clustering of the blog-posts. We wish to offer this dataset free for others who aspire to pursue further in this domain.

This paper details construction of the blog data set used in Sentence based semantic similarity measure for blog-posts.

The aspect I found most interesting was the restriction of the data set to a particular domain. When I was using physical research tools (books) in libraries, there was no “index to everything” available. Nor would I have used it had it been available.

If I had a social science question (political science major) or later a law question (law school), I would pick a physical research tool (PRT) that was appropriate to the search request. Why? Because specialized publications were curated to facilitate research in a particular area, including identification of synonyms and cross-referencing of information you might otherwise not notice.

Is this blogging dataset a clue that if we created sub-sets of the entire WWW, that we could create indexing/analysis routines specific to those datasets? And hence give users a measurably better search experience?

January 3, 2012

Iraq Body Count report: how many died and who was responsible?

Filed under: Dataset,News — Patrick Durusau @ 5:07 pm

Iraq Body Count report: how many died and who was responsible?

From the Guardian a very useful data set for a number of purposes. Particularly if paired with data on who was in the chain of command for various units.

It isn’t that hard to imagine a war crimes ticker for named individuals, linked to specific reports and acts. As well as more general responsibility for wars of aggression.

We will be waiting a long time for prosecutors who are dependent on particular countries for funding and support to step up and fully populate such a list with all responsible parties.

OpenData

Filed under: Dataset,Government Data — Patrick Durusau @ 5:06 pm

OpenData by Socrata

Another very large public data set collection.

Socrata developed the City of Chicago portal, which I mentioned at: Accessing Chicago, Cook and Illinois Open Data via PHP.

Mining Massive Data Sets – Update

Filed under: BigData,Data Analysis,Data Mining,Dataset — Patrick Durusau @ 5:03 pm

Mining Massive Data Sets by Anand Rajaraman and Jeff Ullman.

Update of Mining of Massive Datasets – eBook.

The hard copy has been published by Cambridge Press.

The electronic version remains available for download. (Hint, suggest all of us who can should buy a hard copy to encourage this sort of publisher behavior.)

Homework system for both instructors and self-guided study is available at this page.

While I wait for a hard copy to arrive, I have downloaded the PDF version.

December 31, 2011

Weecology … has new mammal dataset

Filed under: Dataset — Patrick Durusau @ 7:21 pm

Weecology … has new mammal dataset

A post on using R with the Weecology data set.

From the post:

So the Weecology folks have published a large dataset on mammal communities in a data paper in Ecology. I know nothing about mammal communities, but that doesn’t mean one can’t play with the data…

Knowing nothing about data sets hasn’t deterred any number of think tanks and government organizations. It would make useful merging more difficult. But, if you really don’t know anything about the data set, then one merging is just as good as another. Perhaps you should go to work in one of the 2012 campaigns. 😉

On the other hand, you could learn about the data set and merge it usefully into other data sets for a particular area. Or for use in data-based planning of civic projects.

December 26, 2011

A Christmas Miracle

Filed under: Dataset,Government Data — Patrick Durusau @ 8:22 pm

A Christmas Miracle

From the post:

Data files on 407 banks, between the dates of 2007 to 2009, on the daily borrowing with the US Federal Reserve bank. The data sets are available from Bloomberg at this address data

This is an unprecedented look into the day-to-day transactions of banks with the Feds during one of the worse and unusual times in US financial history. A time of weekend deals, large banks being summoned to sign contracts, and all around chaos. For the economist, technocrat, and R enthusiasts this is the opportunity of a life time to examine and analyze financial data normally held in the strictest of confidentiality. A good comparison would be taking all of the auto companies and getting their daily production, sales, and cost data for two years and sending it out to the world. Never has happened.

Not to get too excited, what were released were daily totals, not the raw data itself.

Being a naturally curious person, when someone releases massaged data when the raw data would have been easier, I have to wonder what would I see if I had the raw data? Or perhaps in a topic maps context, what subjects could I link up with the raw data that I can’t with the massaged data?

December 22, 2011

Opening Up the Domesday Book

Filed under: Census Data,Dataset,Domesday Book,Geographic Data — Patrick Durusau @ 7:38 pm

Opening Up the Domesday Book by Sam Leon.

From the post:

Domesday Book might be one of the most famous government datasets ever created. Which makes it all the stranger that it’s not freely available online – at the National Archives, you have to pay £2 per page to download copies of the text.

Domesday is pretty much unique. It records the ownership of almost every acre of land in England in 1066 and 1086 – a feat not repeated in modern times. It records almost every household. It records the industrial resources of an entire nation, from castles to mills to oxen.

As an event, held in the traumatic aftermath of the Norman conquest, the Domesday inquest scarred itself deeply into the mindset of the nation – and one historian wrote that on his deathbed, William the Conqueror regretted the violence required to complete it. As a historical dataset, it is invaluable and fascinating.

In my spare time, I’ve been working on making Domesday Book available online at Open Domesday. In this, I’ve been greatly aided by the distinguished Domesday scholar Professor John Palmer, and his geocoded dataset of settlements and people in Domesday, created with AHRC funding in the 1990s.

I guess it really is all a matter of perspective. I have never thought of the Domesday Book as a “government dataset….” 😉

Certainly would make an interesting basis for a chronological topic map tracing the ownership and fate of “…almost every acre of land in England….”

December 14, 2011

IBM and Drug Companies Donate Data

Filed under: Cheminformatics,Dataset — Patrick Durusau @ 7:46 pm

IBM Contributes Data to the National Institutes of Health to Speed Drug Discovery and Cancer Research Innovation

From the post:

In collaboration with AstraZeneca, Bristol-Myers Squibb, DuPont and Pfizer, IBM is providing a database of more than 2.4 million chemical compounds extracted from about 4.7 million patents and 11 million biomedical journal abstracts from 1976 to 2000. The announcement was made at an IBM forum on U.S. economic competitiveness in the 21st century, exploring how private sector innovations and investment can be more easily shared in the public domain.

Excellent news and kudos to IBM and its partners for making the information available!

Now it is up to you to find creative ways to explore, connect up, analyze the data across other information sets.

My first question would be what was mentioned besides chemicals in the biomedical journal abstracts? Care to make an association to represent that relationship?

Why? Well, for example, if you are exposed to raw benzene, a by product of oil refining, it can produce symptoms that are nearly identical to leukemia. Where would you encounter such a substance? Well, try living in Nicaragua for more than a decade and every day the floors are cleaned with raw benzene. Of course, in the States, doctors don’t check for exposure to banned substances. Cases like that.

BTW, the data is already up, see: PubChem. Follow the links to the interface and click on “structures.” Not my area but the chemical structures are interesting enough that I may have to get a chemistry book for Christmas so I can have some understanding of what I am seeing.

That is probably the best part of being interested in semantic integration is that it cuts across all fields and new discoveries await with every turn of the page.

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus

Filed under: Corpus Linguistics,Dataset — Patrick Durusau @ 11:00 am

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus by Tao Chen, Min-Yen Kan.

Abstract:

Short Message Service (SMS) messages are largely sent directly from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data has not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors’ concerns. Our live project collects new SMS message submissions, checks their quality and adds the valid messages, releasing the resultant corpus as XML and as SQL dumps, along with corpus statistics, every month. We opportunistically collect as much metadata about the messages and their sender as possible, so as to enable different types of analyses. To date, we have collected about 60,000 messages, focusing on English and Mandarin Chinese.

A unique and publicly available corpus of material.

Your average marketing company might not have an SMS corpus for you to work with but I can think of some other organizations that do. 😉 Train on this one to win your spurs.

December 11, 2011

tokenising the visible english text of common crawl

Filed under: Cloud Computing,Dataset,Natural Language Processing — Patrick Durusau @ 10:20 pm

tokenising the visible english text of common crawl by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Well, 30TB of data, that certainly sounds like a small project. 😉

What small amount of data are you using for your next project?

November 30, 2011

More Google Cluster Data

Filed under: Clustering (servers),Dataset,Systems Research — Patrick Durusau @ 8:03 pm

More Google Cluster Data

From the post:

Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.

In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.

Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:

I remember Robert Barta describing the use of topic maps for systems administration. This data set could give some insight into the design of a topic map for cluster management.

What subjects and relationships would you recognize, how and why?

If you are looking for employment, this might be a good way to attract Google’s attention. (Hint to Google: Releasing interesting data sets could be a way to vet potential applicants in realistic situations.)

November 28, 2011

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Filed under: Data Mining,Dataset,Extraction — Patrick Durusau @ 7:05 pm

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Ryan Rosario.

From the post:

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

  • article content and template pages
  • article content with revision history (huge files)
  • article content including user pages and talk pages
  • redirect graph
  • page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
  • image metadata
  • site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

All of that is available but also lacking any consistent usage of syntax. Ryan stumbles upon Wikipedia Extractor, which has pluses and minuses, an example of that latter being really slow. Things look up for Ryan when he is reminded about Cloud9, which is designed for a MapReduce environment.

Read the post to see how things turned out for Ryan using Cloud9.

Depending on your needs, Wikipedia URLs are a start on subject identifiers, although you will probably need to create some for your particular domain.

November 20, 2011

Download network graph data sets from Konect – the koblenz network colection

Filed under: Dataset,Graphs — Patrick Durusau @ 4:12 pm

Download network graph data sets from Konect – the koblenz network colection

From the post:

One of the first things I did @ my Institute when starting my PhD program was reading the PhD thesis of Jérôme Kunegis. For a mathematician a nice piece of work to read. For his thesis he analayzed the evolution of networks. Over the last years Jérôme has collected several (119 !) data sets with network graphs. All have different properties.

He provides the data sets and some basic statistics @ http://konect.uni-koblenz.de/networks

Sometimes edges are directed, somtimes they have timestamps somtimes even content. Some graphs are bipartite and the graphs come from different application domains such as trust, social networks, web graphs, co citation, semantic, features, ratings and communications…

I was browsing René’ Pickhardt’s blog when I ran across this entry. Pure gold.

November 12, 2011

Mining Lending Club’s Goldmine of Loan Data Part I of II…

Filed under: Dataset,R,Visualization — Patrick Durusau @ 8:43 pm

Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State by Tanya Cashorali.

Very cool post that combines using R with analysis of a financial data set, plus visualization by state in the United States.

Of course the data has a uniform semantic so it really doesn’t present the issues that topic maps normally deal with. Or does it?

What if instead of loan data I had campaign contributions and the promised (but not delivered so far as I know) federal contract database? Which no doubt will have very different terminology as well as shadows and shell companies to conceal interested parties.

Developing your skills with R and visualization of mono-semantic data sets will stand you in good stead when you encounter more complex cases.

November 10, 2011

Google1000 dataset

Filed under: Dataset,Image Recognition,Machine Learning — Patrick Durusau @ 6:46 pm

Google1000 dataset

From the post:

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download: (see the post for the links)

Intended for OCR and machine learning purposes. The results of which you may wish to unite in topic maps with other resources.

November 9, 2011

How Common Is Merging?

Filed under: Dataset,Merging,Topic Map Software,Topic Maps — Patrick Durusau @ 7:44 pm

I started wondering about how common merging is in topic maps because I discovered a lack I have not seen before. There aren’t any large test collections of topic maps for CS types to break their clusters against. The sort of thing that challenges their algorithms and hardware.

But test collections should have some resemblance to actual data sets, at least if that is known with any degree of certainty. Or at least be one of the available data sets.

As a first step towards exploring this issue, I grepped for topics in the Opera and CIA Fact Book and got:

  • Opera topic map: 29,738
  • CIA Fact Book: 111,154

for a total of 140,892 topic elements. After merging the two maps, there were 126,204 topic elements. So I count that as merging 14,688 topic elements.

Approximately 10% of the topics in the two sets.

A very crude way to go about this but I was looking for rough numbers that may provoke some discussion and more refined measurements.

I mention that because one thought I had was to simply “cat” the various topic maps at the topicmapslab.de in CTM format together into one file and to “cat” that file until I have 1 million, 10 million and 100 million topic sets (approximately). Just a starter set to see what works/doesn’t work before scaling up the data sets.

Creating the files in this manner is going to result in a “merge heavy” topic map due to the duplication of content. That may not be a serious issue and perhaps better that it be that way in order to stress algorithms, etc. It would have the advantage that we could merge the original set and then project the number of merges that should be found in the various sets.

Suggestions/comments?

November 4, 2011

More Data: Tweets & News Articles

Filed under: Dataset,News,TREC,Tweets — Patrick Durusau @ 6:07 pm

From Max Lin’s blog, Ian Soboroff posted:

Two new collections being released from TREC today:

The first is the long-awaited Tweets2011 collection. This is 16 million tweets sampled by Twitter for use in the TREC 2011 microblog track. We distribute the tweet identifiers and a crawler, and you download the actual tweets using the crawler. http://trec.nist.gov/data/tweets/

The second is TRC2, a collection of 1.8 million news articles from Thompson Reuters used in the TREC 2010 blog track. http://trec.nist.gov/data/reuters/reuters.html

Both collections are available under extremely permissive usage agreements that limit their use to research and forbid redistribution, but otherwise are very open as data usage agreements go.

It may just be my memory but I don’t recall seeing topic map research with the older Reuters data set (the new one is too recent). Is that true?

Anyway, more large data sets for your research pleasure.

November 1, 2011

Facebook100 data and a parser for it

Filed under: Data,Dataset — Patrick Durusau @ 3:33 pm

Facebook100 data and a parser for it

From the post:

A few weeks ago, Mason Porter posted a goldmine of data, the Facebook100 dataset. The dataset contains all of the Facebook friendships at 100 US universities at some time in 2005, as well as a number of node attributes such as dorm, gender, graduation year, and academic major. The data was apparently provided directly by Facebook.

As far as I know, the dataset is unprecedented and has the potential advance both network methods and insights into the structure of acquaintanceship. Unfortunately, the Facebook Data Team requested that Porter no longer distribute the dataset. It does not include the names of individual or even of any of the node attributes (they have been given integer ids), but Facebook seems to be concerned. Anonymized network data is after all vulnerable to de-anonymization (for some nice examples of why, see the last 20 minutes of this video lecture from Jon Kleinberg).

It’s a shame that Porter can no longer distribute the data. On the other hand, once a dataset like that has been released, will the internet be able to forget it? After a bit of poking around I found the dataset as a torrent file. In fact, if anyone is seeding the torrent, you can download it by following this link and it appears to be on rapidshare.

Can anyone confirm a location for the Facebook100 data? I get “file removed” from the brave folks at rapidshare and ads to register for various download services (before knowing the file is available) from the torrent site. Thanks!

« Newer PostsOlder Posts »

Powered by WordPress