Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 31, 2011

Webdam Project: Foundations of Web Data Management

Filed under: Data,Data Management,Web Applications,XML — Patrick Durusau @ 7:28 pm

Webdam Project: Foundations of Web Data Management

From the homepage:

The goal of the Webdam project is to develop a formal model for Web data management. This model will open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.

Books from the project:

  • Foundation of Database, Serge Abiteboul, Rick Hull, Victor Vianu, open access online edition
  • Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart, open access online edition
  • Modeling, Querying and Mining Uncertain XML Data Evgeny Kharlamov and Pierre Senellart, , In A. Tagarelli, editor, XML Data Mining: Models, Methods, and Applications. IGI Global, 2011. open access online edition

I discovered this project via a link to “Web Data Management and Distribution” in Christophe Lalanne’s A bag of tweets / Dec 2011, that pointed to the PDF file, some 400 pages. I went looking for the HTML page with the link and discovered this project along with these titles.

There are a number of other publications associated with the project that you may find useful. The “Querying and Mining Uncertain XML” is only a chapter out of a larger publication by IGI Global. About what one expects from IGI Global. Cambrige Press published the title just proceeding this chapter and allows download for personal use of the entire book.

I think there is a lot to be learned from this project, even if it has not resulted in a universal framework for web applications that exchange data. I don’t think we are in any danger of universal frameworks on or off the web. And we are better for it.

December 29, 2011

Web scraping with Python – the dark side of data

Filed under: Data,Data Mining,Python,Web Scrapers — Patrick Durusau @ 9:14 pm

Web scraping with Python – the dark side of data

From the post:

In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites.

“..user-unfriendly websites.”? What about “user-hostile websites?” 😉

Looks like a good presentation up to “user-unfriendly.”

It will be useful for anyone who needs data from sites that are not configured to delivery it properly (that is to users).

I suppose “user-hostile” would fall under some prohibited activity.

Would make a great title for a book: “Penetration and Mapping of Hostile Hosts.” Could map of vulnerable hosts with their exploits as a network graph.

December 18, 2011

The best way to get value from data is to give it away

Filed under: Data,Marketing — Patrick Durusau @ 8:49 pm

The best way to get value from data is to give it away from the Guardian.

From the article:

Last Friday I wrote a short piece on for the Datablog giving some background and context for a big open data big policy package that was announced yesterday morning by Vice President Neelie Kroes. But what does the package contain? And what might the new measures mean for the future of open data in Europe?

The announcement contained some very strong language in support of open data. Open data is the new gold, the fertile soil out of which a new generation of applications and services will grow. In a networked age, we all depend on data, and opening it up is the best way to realise its value, to maximise its potential.

There was little ambiguity about the Commissioner’s support for an ‘open by default’ position for public sector information, nor for her support for the open data movement, for “those of us who believe that the best way to get value from data is to give it away“. There were props to Web Inventor Tim Berners-Lee, the Open Knowledge Foundation, OpenSpending, WheelMap, and the Guardian Datablog, amongst others.

Open government data at no or low cost, represents a real opportunity for value-add data vendors. Particularly those using topic maps.

Topic maps enable the creation of data products that can be easily integrated with data products created from different perspectives.

Not to mention reuse of data analysis to create new products to respond to public demand.

For example, after the recent misfortunes with flooding and nuclear reactors in Japan, there was an upsurge of interest in the safety of reactors in other countries. The information provided by news outlets was equal parts summary and reassurance. A data product that mapped together known issues with the plants in Japan, their design, inspection reports on reactors in some locale, plus maps of their locations, etc., would have found a ready audience.

Creation of a data product like that, in time to catch the increase in public interest, would depend on prior analysis of large amounts public data. Analysis that could be re-used for a variety of purposes.

December 17, 2011

Data, Best Used By…

Filed under: Data — Patrick Durusau @ 7:51 pm

Data, Best Used By:

From the post:

To state the obvious, “Big Data” is big. The deluge of data, has people talking about volume of data, which is understandable, but not as much attention has been paid to how the value of data can age. Instead, value is often actually not just about volume. It can also be thought of as perishable.

The same can be said of data in a topic map. Note the emphasis on can. Whether your data is perishable or not, depends both on the data and on your requirements.

Historical trend data, for example, isn’t so much perishable as it may be incomplete (if not kept current).

December 13, 2011

Tiered Storage Approaches to Big Data:…

Filed under: Archives,Data,Storage — Patrick Durusau @ 9:47 pm

Tiered Storage Approaches to Big Data: Why look to the Cloud when you’re working with Galaxies?

Event Date: 12/15/2011 02:00 PM Eastern Standard Time

From the email:

The ability for organizations to keep up with the growth of Big Data in industries like satellite imagery, genomics, oil and gas, and media and entertainment has strained many storage environments. Though storage device costs continue to be driven down, corporations and research institutions have to look to setting up tiered storage environments to deal with increasing power and cooling costs and shrinking data center footprint of storing all this big data.

NASA’s Earth Observing System Data and Information Management (EOSDIS) is arguably a poster child when looking at large image file ingest and archive. Responsible for processing, archiving, and distributing Earth science satellite data (e.g., land, ocean and atmosphere data products), NASA EOSDIS handles hundreds of millions of satellite image data files averaging roughly from 7 MB to 40 MB in size and totaling over 3PB of data.

Discover long-term data tiering, archival, and data protection strategies for handling large files using a product like Quantum’s StorNext data management solution and similar solutions from a panel of three experts. Hear how NASA EOSDIS handles its data workflow and long term archival across four sites in North America and makes this data freely available to scientists.

Think of this as a starting point to learn some of the “lingo” in this area and perhaps hear some good stories about data and NASA.

Some questions to think about during the presentation/discussion:

How do you effectively access information after not only the terminology but the world view of a discipline has changed?

What do you have to know about the data and its storage?

How do the products discussed address those questions?

November 30, 2011

No Datum is an Island of Serendip

Filed under: Data,Serendipity — Patrick Durusau @ 8:06 pm

No Datum is an Island of Serendip by Jim Harris.

From the post:

Continuing a series of blog posts inspired by the highly recommended book Where Good Ideas Come From by Steven Johnson, in this blog post I want to discuss the important role that serendipity plays in data — and, by extension, business success.

Let’s start with a brief etymology lesson. The origin of the word serendipity, which is commonly defined as a “happy accident” or “pleasant surprise” can be traced to the Persian fairy tale The Three Princes of Serendip, whose heroes were always making discoveries of things they were not in quest of either by accident or by sagacity (i.e., the ability to link together apparently innocuous facts to come to a valuable conclusion). Serendip was an old name for the island nation now known as Sri Lanka.

“Serendipity,” Johnson explained, “is not just about embracing random encounters for the sheer exhilaration of it. Serendipity is built out of happy accidents, to be sure, but what makes them happy is the fact that the discovery you’ve made is meaningful to you. It completes a hunch, or opens up a door in the adjacent possible that you had overlooked. Serendipitous discoveries often involve exchanges across traditional disciplines. Serendipity needs unlikely collisions and discoveries, but it also needs something to anchor those discoveries. The challenge, of course, is how to create environments that foster these serendipitous connections.”

I don’t disagree about the importance of serendipity but I do wonder about the degree to which we can plan or even facilitate it. At least in terms of software/interfaces, etc.

Remember Malcolm Gladwell and The Tipping Point? Its a great read but there is on difficulty that I don’t think Malcolm dwells on enough. It is one thing to pick out tipping points (or alleged ones) in retrospect. It is quite another to pick out a tipping point before it occurs and to plan to take advantage of it. There are any number of rationalist explanations for various successes, but that are all after the fact constructs that serve particular purposes.

I do think we can make serendipity more likely by exposing people to a variety of information that makes the realization of connections between information more likely. That isn’t to say that serendipity will happen, just that we can create circumstances for people that will make the conditions ripe for it.

November 24, 2011

FactLab

Filed under: Data,Data Source,Interface Research/Design — Patrick Durusau @ 3:55 pm

FactLab

From the webpage:

Factlab collects official stats from around the world, bringing together the World Bank, UN, the EU and the US Census Bureau. How does it work for you – and what can you do with the data?

From the guardian in the UK.

Very impressive and interactive site.

Don’t agree with their philosophical assumptions about “facts,” but none the less, a number of potential clients do. So long as they are paying the freight, facts they are. 😉

Weather forecast and good development practices

Filed under: Data,Data Management — Patrick Durusau @ 3:54 pm

Weather forecast and good development practices by Paolo Sonego.

From the post:

Inspired by this tutorial, I thought that it would be nice to have the possibility to have access to weather forecast directly from the R command line, for example for a personalized start-up message such as the one below:

Weather summary for Trieste, Friuli-Venezia Giulia:
The weather in Trieste is clear. The temperature is currently 14°C (57°F). Humidity: 63%.

Fortunately, thanks to the always useful Duncan Temple Lang’s XML package (see here for a tutorial about XML programming under R), it is straightforward to write few lines of R code to invoke the google weather api for the location of interest, retrieve the XML file, parse it using the XPath paradigm and get the required informations:

You may need weather information for your topic map but more importantly, it will be useful if small routines or libraries are written for common data sets. There is little reason for multiple libraries for say census data, unless the data is substantially different.

November 20, 2011

On Data and Jargon

Filed under: Crowd Sourcing,Data,Jargon — Patrick Durusau @ 4:19 pm

On Data and Jargon by Phil Simon.

From the post:

I was recently viewing an online presentation from my friend Scott Berkun. In it, Scott uses austere slides like the one with this simple bromide:

Whoever uses the most jargon has the least confidence in their ideas.

I really like that.

Are we hiding from crowds behind our jargon?

If yes, why? What do we have to lose? What do we have to gain by not hiding?

November 10, 2011

Putting Data in the Middle

Filed under: Data,Interoperability — Patrick Durusau @ 6:45 pm

Putting Data in the Middle

Jill Dyche uses a photo of Paul Allen and Bill Gates as a jumping off point to talk about a data-centric view of the world.

Remarking:

IT departments furtively investing in successive integration efforts, hoping for the latest and greatest “single version of the truth” watch their budgets erode and their stakeholders flee. CIOs praying that their latest packaged application gets traction realize that they’ve just installed yet another legacy system. Executives wake up and admit that the idea of a huge, centralized, behemoth database accessible by all and serving a range of business needs was simply a dream. Rubbing their eyes they gradually see that data is decoupled from the systems that generate and use it, and past infrastructure plays have merely sedated them.

I really like the successive integration efforts line.

Jill offers an alternative to that sad scenario, but you will have to read her post to find out!

November 1, 2011

Facebook100 data and a parser for it

Filed under: Data,Dataset — Patrick Durusau @ 3:33 pm

Facebook100 data and a parser for it

From the post:

A few weeks ago, Mason Porter posted a goldmine of data, the Facebook100 dataset. The dataset contains all of the Facebook friendships at 100 US universities at some time in 2005, as well as a number of node attributes such as dorm, gender, graduation year, and academic major. The data was apparently provided directly by Facebook.

As far as I know, the dataset is unprecedented and has the potential advance both network methods and insights into the structure of acquaintanceship. Unfortunately, the Facebook Data Team requested that Porter no longer distribute the dataset. It does not include the names of individual or even of any of the node attributes (they have been given integer ids), but Facebook seems to be concerned. Anonymized network data is after all vulnerable to de-anonymization (for some nice examples of why, see the last 20 minutes of this video lecture from Jon Kleinberg).

It’s a shame that Porter can no longer distribute the data. On the other hand, once a dataset like that has been released, will the internet be able to forget it? After a bit of poking around I found the dataset as a torrent file. In fact, if anyone is seeding the torrent, you can download it by following this link and it appears to be on rapidshare.

Can anyone confirm a location for the Facebook100 data? I get “file removed” from the brave folks at rapidshare and ads to register for various download services (before knowing the file is available) from the torrent site. Thanks!

October 31, 2011

Using StackOverflow’s API to Find the Top Web Frameworks

Filed under: Data,Data Analysis,Searching,Visualization — Patrick Durusau @ 7:32 pm

Using StackOverflow’s API to Find the Top Web Frameworks by Bryce Boe.

From the post:

Adam and I are currently in the process of working on our research about the Execution After Redirect, or EAR, Vulnerability which I previously discussed in my blog post about the 2010 iCTF. While Adam is working on a static analyzer to detect EARs in ruby on rails projects, I am testing how simple it is for a developer to introduce an EAR vulnerability in several popular web frameworks. In order to do that, I first needed to come up with a mostly unbiased list of popular web frameworks.

My first thought was to perform a search on the top web frameworks hoping that the information I seek may already be available. This search provided a few interesting results, such as the site, Best Web-Frameworks as well as the page Framework Usage Statistics by the group BuiltWith. The Best Web-Frameworks page lists and compares various web frameworks by language, however it offers no means to compare the adoption of each. The Framework Usage Statistics page caught my eye as its usage statistics are generated by crawling and fingerprinting various websites in order to determine what frameworks are in use. Their fingerprinting technique, however, is too generic in some cases thus resulting in the labeling of languages like php and perl as frameworks. While these results were a step in the right direction, what I was really hoping to find was a list of top web frameworks that follow the model, view, controller, or MVC, architecture.

After a bit more consideration I realized it wouldn’t be very simple to get a list of frameworks by usage, thus I had to consider alternative metrics. I thought how I could measure the popularity of the framework by either the number of developers using or at least interested in the framework. It was this train of thought that lead me to both Google Trends and StackOverflow. Google Trends allows one to perform a direct comparison of various search queries over time, such as ruby on rails compared to python. The problem, as evidenced by the former link, is that some of the search queries don’t directly apply to the web framework; in this case not all the people searching for django are looking for the web framework. Because of this problem, I decided a more direct approach was needed.

StackOverflow is a website geared towards developers where they can go to ask questions about various programing languages, development environments, algorithms, and, yes, even web frameworks. When someone asks a question, they can add tags to the question to help guide it to the right community. Thus if I had a question about redirects in ruby on rails, I might add the tag ruby-on-rails. Furthermore if I was interested in questions other people had about ruby on rails I might follow the ruby-on-rails tag.

Bryce’s use of the StackOverflow’s API is likely to interest anyone creating topic maps on CS topics. Not to mention that his use of graphs for visualization is interesting as well.

October 30, 2011

Data: Making a List of the Top 300 Blogs about Data, Who Did We Miss?

Filed under: Data,Dataset — Patrick Durusau @ 7:04 pm

Data: Making a List of the Top 300 Blogs about Data, Who Did We Miss? by Marshall Kirkpatrick.

From the post:

Dear friends and neighbors, as part of my ongoing practice of using robots and algorithms to make grandiose claims about topics I know too little about, I have enlisted a small army of said implements of journalistic danger to assemble the above collection of blogs about data. I used a variety of methods to build the first half of the list, then scraped all the suggestions from this Quora discussion to flesh out the second half. Want to see if your blog is on this list? Control-F and search for its name or URL and your browser will find it if it’s there.

Why data? Because we live in a time when the amount of data being produced is exploding and it presents incredible opportunities for software developers and data analysts. Opportunities to build new products and services, but also to discover patterns. Those patterns will represent further opportunities for innovation, or they’ll illuminate injustices, or they’ll simply delight us with a greater sense of self-awareness than we had before. (I was honored to have some of my thoughts on data as a platform cited in this recent Slate write-up on the topic, if you’re interested in a broader discussion.) Data is good, and these are the leading people I’ve found online who are blogging about it.

A bit dated now but instructive for the process of mining and then ranking the blogs. There are any number of subject areas that await similar treatment.

  1. What subject area would interest you enough to collect the top 100 or 300 blogs?
  2. Would collecting and ranking be enough to be useful? For what purposes? Where would that fail?
  3. How would you envision topic maps making a difference for such a collection of blogs?

October 28, 2011

Strata Conference: Making Data Work

Filed under: Conferences,Data,Data Mining,Data Science — Patrick Durusau @ 3:15 pm

Strata Conference: Making Data Work Proceedings from the New York Strata Conference, Sept. 22-23, 2011.

OK, so you missed the live video feeds. Don’t despair, videos are available for some and slides appear to be available for all. Not like being there or seeing the videos but better than missing it altogether!

A number of quite remarkable presentations.

Dealing with Data (Science 11 Feb. 2011)

Filed under: Data,Data Mining,Data Science — Patrick Durusau @ 3:11 pm

Dealing with Data (Science 11 Feb. 2011)

From the website:

In the 11 February 2011 issue, Science joins with colleagues from Science Signaling, Science Translational Medicine, and Science Careers to provide a broad look at the issues surrounding the increasingly huge influx of research data. This collection of articles highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.

The Science cover (left) features a word cloud generated from all of the content from the magazine’s special section.

Science is making access to this entire collection FREE (simple registration is required for non-subscribers).

Better late than never!

This is a very good overview of the big data issue, from a science perspective.

October 27, 2011

Data.gov

Filed under: Data,Data Mining,Government Data — Patrick Durusau @ 4:46 pm

Data.gov

A truly remarkable range of resources from the U.S. Federal Government, that is made all the more interesting by Data.gov Next Generation:

Data.gov starts an exciting new chapter in its evolution to make government data more accessible and usable than ever before. The data catalog website that broke new grounds just two years ago, is once again redefining the Open Data experience. Learn more about Data.gov’s transformation into a cloud-based Open Data platform for citizens, developers and government agencies in this 4-minute introductory video.

Developers should take a look at: http://dev.socrata.com/.

October 6, 2011

Data Without Borders

Filed under: Data,Non-Profit,Volunteer — Patrick Durusau @ 5:35 pm

Data Without Borders

From the blog site:

Data Without Borders seeks to match non-profits in need of data analysis with freelance and pro bono data scientists who can work to help them with data collection, analysis, visualization, or decision support.

Would not be a bad place to show off your topic maps skills and the results of using them!

October 5, 2011

On Understanding Data Abstraction, Revisited

Filed under: Data,Semantics — Patrick Durusau @ 6:56 pm

On Understanding Data Abstraction, Revisited by William R. Cook.

Abstract:

In 1985 Luca Cardelli and Peter Wegner, my advisor, published an ACM Computing Surveys paper called “On understanding types, data abstraction, and polymorphism”. Their work kicked off a flood of research on semantics and type theory for object-oriented programming, which continues to this day. Despite 25 years of research, there is still widespread confusion about the two forms of data abstraction, abstract data types and objects. This essay attempts to explain the differences and also why the differences matter.

With all the talk about data types, this is worth re-reading.

September 24, 2011

What are the best blogs about data? Why?

Filed under: Data,Data Science — Patrick Durusau @ 6:58 pm

What are the best blogs about data? Why?

A very extensive listing (as you can imagine) of blogs about data.

Quora reports:

This question has been viewed 15577 times; it has 2 monitors with 21188 topic followers and 0 aliases exist.

831 people are following this question.

So the question is popular.

How would you make the answer more useful?

September 22, 2011

Visualizing Lexical Novelty in Literature

Filed under: Data,Visualization — Patrick Durusau @ 6:19 pm

Visualizing Lexical Novelty in Literature by Matthew Hurst.

From the post:

Novels are full of new characters, new locations and new expressions. The discourse between characters involves new ideas being exchanged. We can get a hint of this by tracking the introduction of new terms in a novel. In the below visualizations (in which each column represents a chapter and each small block a paragraph of text), I maintain a variable which represents novelty. When a paragraph contains more than 25% new terms (i.e. words that have not been observed thus far) this variable is set at its maximum of 1. Otherwise, the variable decays. The variable is used to colour the paragraph with red being 1.0 and blue being 0. The result is that we can get an idea of the introduction of new ideas in novels.

As aways, interesting ideas on text visualization from Matthew Hurst.

Curious how much novelty (change?) would you see between SEC filings from the same law firm? Or put another way, how much boilerplate is there in regulatory filings? I am mindful of the disaster plan for BP that included saving polar bears in the Gulf of Mexico.

Another interesting tool for exploring data and data sets in preparation to create topic maps.

September 19, 2011

Introducing CorporateGroupings

Filed under: Data,Dataset,Fuzzy Matching — Patrick Durusau @ 7:51 pm

Introducing CorporateGroupings: where fuzzy concepts meet legal entities

From the webpage:

One of the key issues when you’re looking at any big company is what are the constituent parts – because these days a company of any size is pretty much never a single legal entity, but a web of companies, often spanning multiple jurisdictions.

Sometimes this is done because the company’s operations are in different territories, sometimes because the company is a conglomerate of different companies – an educational book publisher and a financial newspaper, for example. Sometimes it’s done to limit the company’s tax liability, or for other legal reasons (e.g. to benefit from a jurisdiction’s rules & regulations compared with the ‘parent’ company’s jurisdiction).

Whatever the reason, getting a handle on the constituent parts is pretty tricky, whether you’re a journalist, a campaigner, a government tax official or a competitor, and making it public is trickier still, meaning the same research is duplicated again and again. And while we may all want to ultimately surface in detail the complex cross-holdings of shareholdings between the different companies, that goal is some way off, not least because it’s not always possible to discover the shareholders of a company.

….

So you must make do with reading annual reports and trawling company registries around the world, and hoping you don’t miss any. We like to think OpenCorporates has already made this quite a bit easier, meaning that a single search for Tesco returns hundreds of results from around the world, not just those in the UK, or some other individual jurisdiction. But what about where the companies don’t include the group in the name, and how do you surface the information you’ve found for the rest of the world?

The solution to both, we think, is Corporate Groupings, a way of describing a grouping of companies without having to say exactly what legal form that relationship takes (it may be a subsidiary of a subsidiary, for example). In short, it’s what most humans (i.e. non tax-lawyers) think of when they think of a large company – whether it’s a HSBC, Halliburton or HP.

This could have legs.

Not to mention what is a separate subject to you (subsidiary) may be encompassed by a larger subject to me. Both are valid from a certain point of view.

September 16, 2011

Strata 2011 Live Video Stream

Filed under: BigData,Conferences,Data — Patrick Durusau @ 6:43 pm

Strata 2011 Live Video Stream

From the webpage:

In case you don’t have the luck to be in New York around this time, but want to get a glimpse at what’s happening at the Strata Conference listen up: O’Reilly kindly provides live broadcasts from keynotes, talks and workshops. You can see the full schedule of broadcasts here: http://datavis.ch/oBT4EO.

Strata doesn’t ring a bell in your head? It’s one of the biggest conferences focused on data and the business around it organized by O’Reilly.

Strata Conference covers the latest and best tools and technologies for this new discipline, along the entire data supply chain—from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. With hardcore technical sessions on parallel computing, machine learning, and interactive visualizations; case studies from finance, media, healthcare, and technology; and provocative reports from experts and innovators, Strata Conference showcases the people, tools, and technologies that make data work.

This is the other reason I buy O’Reilly publications.

Building Data Science Teams

Filed under: Data,Data Analysis — Patrick Durusau @ 6:38 pm

Building Data Science Teams -The Skills, Tools, and Perspectives Behind Great Data Science Groups by DJ Patil.

From page 1:

Given how important data science has grown, it’s important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.

Nothing you probably haven’t heard before but a reminder isn’t a bad thing.

The tools to manipulate data are becoming commonplace. What remains and will remain elusive, will be the skills to use those tools well.

September 14, 2011

FigShare

Filed under: Data,Dataset — Patrick Durusau @ 6:59 pm

FigShare

From the website:

Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures.

There wasn’t a category on the site for CS data sets or rather the results of processing/searching data sets.

Would that be the same thing?

Thinking it would be interesting to have examples of data analysis that failed along with the data sets in question. Or at least pointers to the data sets.

September 10, 2011

GTD – Global Terrorism Database

Filed under: Authoring Topic Maps,Data,Data Integration,Data Mining,Dataset — Patrick Durusau @ 6:08 pm

GTD – Global Terrorism Database

From the homepage:

The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2010 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 98,000 cases.

While chasing down a paper that didn’t make the cut I ran across this data source.

Lacking an agreed upon definition of terrorism (see Chomsky for example), you may or may not find what you consider to be incidents of terrorism in this dataset.

Never the less, it is a dataset of events of popular interest and can be used to attract funding for your data integration project using topic maps.

TV Tropes

Filed under: Authoring Topic Maps,Data,Interface Research/Design — Patrick Durusau @ 6:06 pm

TV Tropes

Sam Hunting forwarded this to my attention.

From the homepage:

What is this about? This wiki is a catalog of the tricks of the trade for writing fiction.

Tropes are devices and conventions that a writer can reasonably rely on as being present in the audience members’ minds and expectations. On the whole, tropes are not clichés. The word clichéd means “stereotyped and trite.” In other words, dull and uninteresting. We are not looking for dull and uninteresting entries. We are here to recognize tropes and play with them, not to make fun of them.

The wiki is called “TV Tropes” because TV is where we started. Over the course of a few years, our scope has crept out to include other media. Tropes transcend television. They reflect life. Since a lot of art, especially the popular arts, does its best to reflect life, tropes are likely to show up everywhere.

We are not a stuffy encyclopedic wiki. We’re a buttload more informal. We encourage breezy language and original thought. There Is No Such Thing As Notability, and no citations are needed. If your entry cannot gather any evidence by the Wiki Magic, it will just wither and die. Until then, though, it will be available through the Main Tropes Index.

I rather like the definition of trope as “devices and conventions that a writer can reasonably rely on as present in the audience members’ minds and expecations.” I would guess under some circumstances we could call those “subjects” which we can include in a topic map. And then, for example, map the occurrences of those subjects in TV shows, for example.

As the site points out, it is called TV Tropes because it started with TV, but tropes have a much larger range than TV.

Being aware of and able to invoke (favorable) tropes in the minds of your users is one part of selling your topic map solution.

September 9, 2011

Kasabi

Filed under: Data,Data as Service (DaaS),Data Source,RDF — Patrick Durusau @ 7:16 pm

Kasabi

A data as service site that offers access to data (no downloads) via API codes. Has helps for authors to prepare their data, APIs for data, etc. Currently in beta.

I mention it because data as service is one model for delivery of topic map content so the successes, problems and usage of Kasabi may be important milestones to watch.

True, Lexis/Nexis, WestLaw, and any number of other commercial vendors have sold access to data in the past but it was mostly dumb data. That is you had to contribute something to it to make it meaningful. We are in the early stages but I think a data market for data that works with my data is developing.

The options to download citations in formats that fit particular bibliographic programs are an impoverished example of delivered data working with local data.

Not quite the vision for the Semantic Web but it isn’t hard to imagine your calendaring program showing links to current news stories about your appointments. You have to supply the reasoning to cancel the appointment with the bank president just arrested for securities fraud and to increase your airline reservations to two (2).

August 18, 2011

Building data startups: Fast, big, and focused

Filed under: Analytics,BigData,Data,Data Analysis,Data Integration — Patrick Durusau @ 6:54 pm

Building data startups: Fast, big, and focused (O’Reilly original)

Republished by Forbes as:
Data powers a new breed of startup

Based on the talk Building data startups: Fast, Big, and Focused

by Michael E. Driscoll

From the post:

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Describes the emerging big data stack and says:

The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

The future isn’t going to be getting users to develop topic maps but your use of topic maps (and other tools) to create data products of interest to users.

Think of it as being the difference between selling oil change equipment versus being the local Jiffy Lube. (Sorry, for non-U.S. residents, Jiffy Lube is a chain of oil change and other services. Some 2,000 locations in the North America.) I dare say that Jiffy Lube and its competitors do more auto services than users of oil change equipment.

July 24, 2011

Data Triage with SAS

Filed under: Data — Patrick Durusau @ 6:47 pm

Data Triage with SAS

Deeply amusing and useful post on basics of looking at data to spot obvious issues.

It really doesn’t matter how clever your analysis may be if your data is incorrect or more likely, your assumptions about the data are incorrect.

Take heed.

July 15, 2011

Data, Graphs, and Combinatorics…Oh My!

Filed under: Combinatorics,Conferences,Data,Graphs — Patrick Durusau @ 6:48 pm

DATA, GRAPHS, and COMBINATORICS in BIO-INFORMATICS, FINANCE, LINGUISTICS, and NATIONAL SECURITY

From the workshops page:

26-27 July 2011
College of Staten Island
City University of New York

This two-day workshop is designed to address current topics in Data Intensive Computing. Topics covered will include computational statistical, graph theoretic and combinatoric approaches in bio-informatics, financial data analytics, linguistics, and national security.

Technology is allowing researchers, government agencies, and companies to acquire huge amounts of information in the sciences and on human behavior. The challenge confronting researchers is how to discern meaningful information and relationships from this plethora of data. The workshop will focus on new algorithmic techniques and their computational implementation, including:

  • How Watson won Jeopardy!
  • Computational graph theory and combinatorics in bio-informatics, on Wall Street, and in National Security applications.
  • The role of high performance computing, graph theory and combinatorics in data intensive computing.

Workshop speakers include noted representatives from academe, government research laboratories, and industry. A list of speakers is attached.

The workshop will be held on Tuesday and Wednesday, July 26-27, 2011 from 8:15 AM to 4:45 PM in the Recital Hall of the Center for the Arts on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314.

A continental breakfast, lunch, and refreshments will be provided each day. There is an attendance fee of $85 per person ($50 for students). Advanced registration is required.

Workshop information:
http://www.csi.cuny.edu/cunyhpc/workshops.php

Registration page:
http://www.csi.cuny.edu/cunyhpc/registration.php

Directions page:
http://www.csi.cuny.edu/cunyhpc/directions.html

For information, send email to:
hpcworkshops@csi.cuny.edu

If you will be on Staten Island, July 26-27, 2011, register and attend!

If you can get to Staten Island, July 26-27, 2011, register and attend!

This looks quite exciting!

« Newer PostsOlder Posts »

Powered by WordPress