Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 24, 2012

Data Virtualization

Filed under: BigData,Data,Data Analysis,Data Virtualization — Patrick Durusau @ 7:17 pm

David Loshin has a series of excellent posts on data virtualization:

Fundamental Challenges in Data Reusability and Repurposing (Part 1 of 3)

Simplistic Approaches to Data Federation Solve (Only) Part of the Puzzle – We Need Data Virtualization (Part 2 of 3)

Key Characteristics of a Data Virtualization Solution (Part 3 of 3)

In part 3, David concludes:

In other words, to truly provision high quality and consistent data with minimized latency from a heterogeneous set of sources, a data virtualization framework must provide at least these capabilities:

  • Access methods for a broad set of data sources, both persistent and streaming
  • Early involvement of the business user to create virtual views without help from IT
  • Software caching to enable rapid access in real time
  • Consistent views into the underlying sources
  • Query optimizations to retain high performance
  • Visibility into the enterprise metadata and data architectures
  • Views into shared reference data
  • Accessibility of shared business rules associated with data quality
  • Integrated data profiling for data validation
  • Integrated application of advanced data transformation rules that ensure consistency and accuracy

What differentiates a comprehensive data virtualization framework from simplistic layering of access and caching services via data federation is that the comprehensive data virtualization solution goes beyond just data federation. It is not only about heterogeneity and latency, but must incorporate the methodologies that are standardized within the business processes to ensure semantic consistency for the business. If you truly want to exploit the data virtualization layer for performance and quality, you need to have aspects of the meaning and differentiation between use of the data engineered directly into the implementation. And most importantly, also make sure the business user signs-off on the data that is being virtualized for consumption. (emphasis added)

David makes explicit a number of issues, such as integration architectures needing to peer into enterprise metadata and data structures, making it plain that not only data, but the ways we contain/store data has semantics as well.

I would add: Consistency and accuracy should be checked on a regular basis with specified parameters for acceptable correctness.

The heterogeneous data sources that David speaks of are ever changing, both in form and semantics. If you need proof of that, consider the history of ETL at your company. If either form or semantics were stable, that would be a once or twice in a career event. I think we all know that is not the case.

Topic maps can disclose the data and rules for the virtualization decisions that David enumerates. Which has the potential to make those decisions themselves auditable and reusable.

Reuse being an advantage in a constantly changing and heterogeneous semantic environment. Semantics seen once, are very likely to be seen again. (Patterns anyone?)

Puzzle: A path through pairs making squares

Filed under: Graphs,R — Patrick Durusau @ 7:16 pm

Puzzle: A path through pairs making squares

A graph based solution to the following problem:

Take the numbers 1, 2, 3, etc. up to 17.

Can you write out all seventeen numbers in a line so that every pair of numbers that are next to each other, adds up to give a square number?

Solved and then displayed with graphs and R.

Graphs are on the upswing as a research area. Worth your while to expand your reading list in that direction.

Zotero – A Manual for Electronic Legal Referencing

Filed under: Law - Sources,Zotero — Patrick Durusau @ 7:16 pm

Zotero – A Manual for Electronic Legal Referencing by John Prebble and Julia Caldwell.

From the abstract:

This manual explains how to operate Zotero.

Zotero is a free, open-source referencing tool that operates by “enter once, use many”. It captures references by one-click acquisition from databases of legal materials that cooperate with it. Users enter other references manually, with similar effort to typing a footnote.

Zotero’s chief strength is multi-style flexibility. Authors build libraries of references that are pasted into scholarly work with one click; authors can choose between legal referencing styles, with Zotero automatically formatting references according to the chosen style. Ability to format seamlessly across a potentially unlimited number of styles distinguishes Zotero from competing referencing tools. Zotero afficionados regularly add more styles.

The present manual is thought to be the only full manual for non-technical users of Zotero. It employs the New Zealand referencing style for examples, but its principles are the same for all styles.

Probably better to say:

“This manual explains how to use Zotero for legal citations.” (And go ahead and put in the link to Zotero, which is a really nifty bit of software.)

Uses New Zealand law for examples.

Do you know if anyone has done U.S. law examples for Zotero?

BTW, Zotero does duplicate merging:

Zotero currently uses the title, DOI, and ISBN fields to determine duplicates. The algorithm will be improved in the future to incorporate other fields.

Zotero could be a light-weight way to get users to gather content for later import and improvement in a topic map. Worth checking out.

Mandelbaum on How XML Can Improve Transparency and Workflows for Legislatures

Filed under: Law,Legal Informatics — Patrick Durusau @ 7:16 pm

Mandelbaum on How XML Can Improve Transparency and Workflows for Legislatures

From Legal Informatics Blog a post reporting on the use of XML in legislatures.

You need to read Mandelbaum’s post (lots of good pointers), where Mandelbaum concedes that open formats != transparency but offers the following advantages to get legislatures around to XML:

  • Preservation.
  • Efficiency.
  • Cost-Effectiveness.
  • Flexibility.
  • Ease of Use.

Personally I would get a group of former legislators to invest in XML based solutions and have them lobby their former colleagues for the new technology. That would take less time than waiting for current vendors to get up to speed on XML.

The various benefits to XML while true, would be how the change to XML is explained to members of the public.

Topic maps could be used by others to track such relationships and changes. That might result in free advertising for the former members of the legislature. A sort of external validation of their effectiveness.

TEI Boilerplate

Filed under: Text Encoding Initiative (TEI),XML — Patrick Durusau @ 7:15 pm

TEI Boilerplate

If you don’t know it, the TEI (Text Encoding Initiative), is one of the oldest digital humanities projects dedicated to fashioning encoding solutions for non-digital texts. The Encoding Guidelines, as they are known, were designed to capture the complexities of pre-digital texts.

If you doubt the complexities of pre-digital texts, consider the following image of a cover page from the Leningrad Codex:

Leningrad Codex Image

Or, consider this page from the Mikraot Gedolot:

Mikraot Gedolot Image

There are more complex pages, such as the mss. of Charles Peirce (Peirce Logic Notebook, Charles Sanders Peirce Papers MS Am 1632 (339). Houghton Library, Harvard University, Cambridge, Mass.):

Peirce Logic Notebook, Charles Sanders Peirce Papers MS AM 1632 (339)

And those are just a few random examples. Encoding pre-digital texts is a complex and rewarding field of study.

Not that “born digital” texts need concede anything to “pre-digital” texts. When you think about our capacity to capture versions, multiple authors, sources, interpretations of readers, discussions and the like, the wealth of material that can be associated with any one text becomes quite complex.

Consider for example the Harry Potter book series that spawned websites, discussion lists, interviews with the author, films and other resources. Not quite like the interpretative history of the Bible but enough to make an interesting problem.

Anything that can encode that range of texts is of necessity quite complex itself and therein lies the rub. You work very hard at document analysis, using or extending the TEI Guidelines to encode your text, now what?

You can:

  1. Show the XML text to family and friends. Always a big hit at parties. 😉
  2. Use your tame XSLT wizard to create a custom conversion of the XML text so normal people will want to see and use it.
  3. Use the TEI Boilerplate project for a stock delivery of the XML text so normal people will want to see and use it. (like your encoders, funders)

From the webpage:

TEI Boilerplate is a lightweight solution for publishing styled TEI (Text Encoding Initiative) P5 content directly in modern browsers. With TEI Boilerplate, TEI XML files can be served directly to the web without server-side processing or translation to HTML. Our TEI Boilerplate Demo illustrates many TEI features rendered by TEI Boilerplate.

Browser Compatibility

TEI Boilerplate requires a robust, modern browser to do its work. It is compatible with current versions of Firefox, Chrome, Safari, and Internet Explorer (IE 9). If you have problems with TEI Boilerplate with a modern browser, please let us know by filing a bug report at https://sourceforge.net/p/teiboilerplate/tickets/.

Many thanks to John Walsh, Grant Simpson, and Saeed Moaddeli, all from Indiana University for this wonderful addition to the TEI toolbox!

PS: If you have disposable funds and aren’t planning on mining asteroids, please consider donating to the TEI (Text Encoding Initiative). Even asteroid miners need to know Earth history, a history written in texts.

Machine learning for identification of cars

Filed under: Machine Learning,R — Patrick Durusau @ 7:14 pm

Machine learning for identification of cars by Dzidorius Martinaitis.

A very awesome post (with code) on capturing video from traffic cameras and training your computer to recognize cars.

The post covers using video from public sources but the thought does occur to me that you could spool the output from a digital camera attached to a telephoto lens to a personal computer for encryption and transfer over the Net. So that you could get higher quality images than off a public feed.

I am sure you will enjoy experimenting with it both as illustrated in the post and as other possibilities suggest themselves to you.

Marketing in the Digital Age: Winning with Data & Analytics

Filed under: BigData — Patrick Durusau @ 7:14 pm

Marketing in the Digital Age: Winning with Data & Analytics by DataXu.

This paper was quoted by the blog I cited in Let’s Party Like it’s 1994!.

It is a helpful report but some cautions:

  • At page 6, the estimated 300 billion in health care savings. This is like RIAA’s estimates of $billion dollar piracy in countries with less than 1 million inhabitants. It’s not shared reality beyond the RIAA marketing department.
  • At page 9, the chart would have you believe current difficulties are about 1/3 “Lack of software/technology to perform analytics on digital marketing data.” Sigh, that is, and always is, wrong.

    Repeat five (5) times: Technology cannot fix personnel issues. (Like lack of talent, insight, inter-departmental moats, etc.)

    Technology is almost always seized upon as an answer because it avoids the tough decisions. Like firing department heads who won’t cooperate with other departments. Or who say: “Our requirements are unique because no one else does it this way.” (I have an answer for that observation if you would like to hear it sometime.) Or who say: “Our staff can’t do X (whatever X may be).” (I have an answer for that observation as well.)

  • At page 18, the use of outsiders for analysis of digital marketing suffers from a lack of ROI information. That should be a clear warning sign. If you can’t get ROI data from an outsider, how are you going to get it with digital marketing in-house?
  • At page 22, “5 Ways You Can Profit from Big Data”

    1. By creating transparency. Organizations that make Big Data available to more stakeholders can accelerate and improve decision-making and cycle times, and reduce search costs;

    Ask anyone who promises you better decision-making to show you examples of the same. On a consistent basis due to big data, controlling all other variables.

  • At page 29, absence of a cross-channel digital marketing platform “…as the most significant impediment to more money flowing to digital marketing.”

    That’s a misconception of the problem. No sane business wants more money flowing to digital marketing. Sane businesses want more ROI and if increasing money flowing to digital marketing does the trick, so be it. But digital marketing for its own sake isn’t a goal, at least not a useful one. With or without a cross-channel digital marketing platform.

If your use of “big data” doesn’t result in understandings cut across departments, teams and even customers, then you may be losing market share and not even know it. At least not immediately.

Let’s Party Like It’s 1994!

Filed under: BigData,Government,Government Data — Patrick Durusau @ 7:13 pm

Just coincidence that I read Fast algorithms for mining association rules (in Mining Basket Data) the same day I read: Big Data Lessons for Big Government by Julie Ginches.

The points Julie pulls out from a study by DateXu could have easily been from 1994.

The dates and names have changed, the challenges have not.

  • Employees need new skills, new technologies, and new ways to combine information from multiple sources so they can make sense of all the data pouring in so they can add more value and be effective. This new way of working directly applies to and will benefit both private industry and government.
  • Organizations need departmental specialists to work with IT to create systems that are better at collecting, managing, and analyzing data. If the government is going to succeed with big data, it will need to find better ways to communicate and collaborate across organizations, with tools that can be used by technical and non-technical staff in order to make discoveries and quickly act.
  • Enterprise businesses need a single, cross-channel platform to manage their data flows. The same is likely to hold true for government agencies that have typically been hamstrung in their data analysis because information is spread across multiple different, disconnected silos and multiple public and private organizations.
  • Seventy-five percent indicate that data has the potential to dramatically improve their business; however, 58 percent report that their organizations don’t have the quantitative skills and technology needed to analyze the data. More than 70 percent report they can’t effectively leverage the full value of their customer data….
  • 90% indicate that digital marketing can reduce customer acquisition costs through increased efficiency, but 46% report that they lack the information they need to communicate the benefits of big data to management….

If we are recycling old problems, that means solutions to those problems failed.

If we use the same solutions for the same problems this time, what result would you expect? (Careful, you only get one answer.)

Look for Let’s Party Like It’s 1994 II, to read about the one commonality of Julie’s five points. The one that topic maps can address, effectively.

Software Review- BigML.com – Machine Learning meets the Cloud

Filed under: Cloud Computing,Machine Learning — Patrick Durusau @ 7:13 pm

Software Review- BigML.com – Machine Learning meets the Cloud.

Ajay Ohri reviews BigML.com, an attempt to lower the learning curve for working with machine learning and large data sets.

Ajay concludes:

Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.

Check out https://bigml.com/ for yourself to see.

I have to agree they are off to a good start.

Lowering the learning curve applications look like “hot” properties for the coming future. Some lose of flexibility but offset by immediate and possibly useful results. Maybe the push some users need to become real experts.

Miso: An open source toolkit for data visualisation

Filed under: Graphics,Visualization — Patrick Durusau @ 7:13 pm

Miso: An open source toolkit for data visualisation

Nathan Yau writes:

Your online visualization options are limited when you don’t know how to program. The Miso Project, a collaboration between The Guardian and Bocoup, is an effort to lighten the barrier to entry.

While the goal is to build a toolkit that makes visualization easier and faster, the first release of the project is Dataset, a JavaScript library to setup the foundation of any good data graphic. If you’ve ever worked with data on the Web, you know there are a variety of (usually painful) steps you have to go through before you actually get to fun stuff. Dataset will help you with the data transformation and and management grunt work.

Nathan says he will be keeping an eye on this project. I suggest that we all do the same.

April 23, 2012

Happy in Bhutan – No Sex, Drugs or Rock-n-Roll?

Filed under: Graphics,Humor,Visualization — Patrick Durusau @ 6:00 pm

Junk Charts has a great piece called The Earth Institute needs a graphics advisor on the World Happiness Report.

Junk Charts has valid criticisms happiness pie chart for the people of Bhutan but when you see the choices:

  • Health
  • Ecological diversity and resilience
  • Pyschological well being
  • Community vitality
  • Living standards
  • Time use
  • Cultural diversity and resilience
  • Good Governance
  • Education

No way to include sex, drugs or rock-n-roll?

Sometimes you don’t need a lot of technical training to know a survey is completely bogus.

Be brave enough to point out an expensive marketing survey was a complete and utter failure. Easier if it was someone else’s idea/project but will earn you high marks in any case. Just don’t repeat the failure.

Mining Basket Data

Filed under: BigData,Data Mining — Patrick Durusau @ 6:00 pm

Database mining is motivated by the decision support problem faced by most large retail oganizations [S+93]. Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data. A record in such data typically consists of transaction date and the items bought in the transaction. Successful organizations view such databases as important pieces of marketing infrastructure [Ass92]. They are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies [Ass90]. (emphasis added)

Sounds like a marketing pitch for big data doesn’t it?

In 1994, Rakesh Agrawal and Ramakrishnan Srikant had basket data and wrote: Fast algorithms for mining association rules (1994). Now listed as the 18th most cited computer science article by Citeseer.

Mining “data” isn’t new nor is mining for association rules. Not to mention your prior experience with association rules.

With a topic map you can capture that prior experience along side new association rules and methods. Marketing wasn’t started yesterday and isn’t going to stop tomorrow. Successful firms are going to build on their experience, not re-invent it with each technology change.

Rough Set Rudiments

Filed under: Rough Sets — Patrick Durusau @ 5:59 pm

Rough Set Rudiments by Zdzislaw Pawlak and Andrzej Skowron.

From the basic philosophy section:

The rough set philosophy is founded on the assumption that with every object of the universe of discourse we associate some information (data, knowledge). For example, if objects are patients suffering from a certain disease, symptoms of the disease form information about patients. Objects characterized by the same information are indiscernible (similar) in view of the available information about them. The indiscernibility relation generated in this way is the mathematical basis for rough set theory.

Any set of all indiscernible (similar) objects is called an elementary set, and forms a basic granule (atom) of knowledge about the universe. Any union of some elementary sets is referred to as crisp (precise) set – otherwise the set is rough (imprecise, vague).

Consequently each rough set has boundary-line cases, i.e., objects which cannot be with certainty classified neither as members of the set nor of its complement. Obviously crisp sets have no boundary-line elements at all. That means that boundary-line cases cannot be properly classified by employing the available knowledge.

Thus, the assumption that objects can be “seen” only through the information available about them leads to the view that knowledge has granular structure. Due to the granularity of knowledge some objects of interest cannot be discerned and appear as the same (or similar). As a consequence vague concepts, in contrast to precise concepts, cannot be characterized in terms of information about their elements. Therefore, in the proposed approach, we assume that any vague concept is replaced by a pair of precise concepts – called the lower and the upper approximation of the vague concept. The lower approximation consists of all the objects which surely belong to the concept and the upper approximation contains all objects which possibly belong to the concept. Obviously, the difference between the upper and lower approximation constitutes the boundary region of the vague concept. Approximations are two basic operations in rough set theory.

Suspect that the normal case is “rough” sets, with “crisp” sets being an artifice of our views of the world.

This summary is a bit dated so I will use it as a basis for an update with citations to later materials.

“AI on the Web” 2012 – Saarbrücken, Germany

Filed under: Artificial Intelligence,Conferences,Heterogeneous Data — Patrick Durusau @ 5:59 pm

“AI on the Web” 2012 – Saarbrücken, Germany

Important Dates:

Deadline for Submission: July 5, 2012

Notification of Authors: August 14, 2012

Final Versions of Papers: August 28, 2012

Workshop: September 24/25, 2012

From the website:

The World Wide Web has become a unique source of knowledge on virtually any imaginable topic. It is continuously fed by companies, academia, and common people with a variety of information in numerous formats. By today, the Web has become an invaluable asset for research, learning, commerce, socializing, communication, and entertainment. Still, making full use of the knowledge contained on the Web is an ongoing challenge due to the special properties of the Web as an information source:

  • Heterogeneity: web data occurs in any kind of formats, languages, data structures and terminology one can imagine.
  • Decentrality: the Web is inherently decentralized which means that there is no central point of control that can ensure consistency or synchronicity.
  • Scale: the Web is huge and processing data at web scale is a major challenge in particular for knowledge‐intensive methods.

These characteristics make the Web a challenging but also a promising chance for AI methods that can help to make the knowledge on the Web more accessible for humans and machines by capturing, representing and using information semantics. The relevance and importance of AI methods for the Web is underlined by the fact that the AAAI – as one of the major AI conferences – has been featuring a special track “AI on the Web” for more than five years now. In line with this track and in order to stress this relevance within the German AI community, we are looking for work on relevant methods and their application to web data.

Look beyond the Web, to the larger world of information of the “deep” web or the even larger world of information, web or not, and what do you see?

Heterogeneity, Decentrality, Scale.

What we learn about AI for the Web may help us with larger information problems.

Wolfram Plays In Streets of Shakespeare’s London

Filed under: Literature,Mathematica — Patrick Durusau @ 5:58 pm

I should have been glad to read: To Compute or Not to Compute—Wolfram|Alpha Analyzes Shakespeare’s Plays. Promoting Shakespeare has to be a first for Wolfram.

But the post reports word counts, unique words, and similar measures as master strokes of engineering, all things familiar since SNOBOL and before. And then makes this “bold” suggestion:

Asking Wolfram|Alpha for information about specific characters is where things really begin to get interesting. We took the dialog from each play and organized them into dialog timelines that show when each character talks within a specific play. For example, if you look at the dialog timeline of Julius Caesar, you’ll notice that Brutus and Cassius have steady dialog throughout the whole play, but Caesar’s dialog stops about halfway through. I wonder why that is?

That sort of analysis was old hat in the 1980’s.

Wolfram needs to catch up on the history of literary and linguistic computing rather than repeating it.

The back issues of Computational Linguistics or Literary and Linguistic Computing should help in that regard. To say nothing of Shakespeare, Computers, and the Mystery of Authorship and similar works.

On digital humanities projects in general, see: Digital Humanities Spotlight: 7 Important Digitization Projects by Maria Popova, for a small sample.

Graphical Data Mapping with Mule

Filed under: Data Integration,Mule — Patrick Durusau @ 5:58 pm

Graphical Data Mapping with Mule

May 3, 2012

From the announcement:

Do you struggle to transform data as part of your integration efforts? Has data transformation become a major pain? You life is about to become a whole lot simpler!

See the new data mapping capabilities of Mule 3.3 in action! Fully integrated with Mule Studio at design time and Mule ESB at run time, Mule’s data mapping empowers developers to build data transformations through a graphical interface without writing custom code.

Join Mateo Almenta Reca, MuleSoft’s Director of Product Management, for a demo-focused preview of:

  • An overview of data mapping capabilities in Mule 3.3
  • Design considerations and deployment of applications that utilize data mapping
  • Several live demonstrations of building various data transformations

Marakana – Open Source Training

Filed under: Education,Training,Video — Patrick Durusau @ 5:57 pm

Marakana – Open Source Training

From the homepage:

Marakana’s raison d’être is to help people get better at what they do professionally. We accomplish this by organizing software training courses (both public and private) as well as publishing learning resources, sharing knowledge from industry leaders, providing a place to share useful tidbits and supporting the community. Our focus is open source software.

I found this while watching scikit-learn – Machine Learning in Python – Astronomy, which was broadcast on Marakana TechTV.

From the Marakana TechTV homepage:

Marakana TechTV is an initiative to provide the world with free educational content on cutting-edge open source topics. Check out our work.

We work with open source communities to cover tech events world wide, as well as industry experts to create high quality informational videos from Marakana’s studio in downtown San Francisco.

…and we do it all at no charge. As an open source training company, Marakana believes in helping people get better at what they do, and through Marakana TechTV we’re able to engage open source communites around the globe, promote our training services, and stay current on the latest and greatest in open source.

Useful content and possibly a place to post educations videos. Such as on topic maps?

scikit-learn – Machine Learning in Python – Astronomy

Filed under: Astroinformatics,Machine Learning,Python — Patrick Durusau @ 5:57 pm

scikit-learn – Machine Learning in Python – Astronomy by Jake VanderPlas. (tutorial)

Jake branched the scikit-learn site for his tutorial on scikit-learn using astronomical data.

Good introduction to scikit-learn and will be of interest to astronomy buffs.

Fahrenheit 118

Filed under: Graphics,Maps,Visualization — Patrick Durusau @ 5:56 pm

Sounds like a scifi knock-off doesn’t it?

Junkcharts tells a different tale: The importance of explaining your chart: the case of the red 118

Great review of an temperature map for March, 2012.

Two take aways:

1) Even expert map makers (NOAA/NCDC) screw up and/or have a very difficult time communicating clearly.

2) Get someone outside your group to review maps/charts without any explanation from you. If they have to ask questions or you feel explanation is necessary, revise the map/chart and try again.

ICDM 2012

ICDM 2012 Brussels, Belgium | December 10 – 13, 2012

From the webpage:

The IEEE International Conference on Data Mining series (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.

ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference features workshops, tutorials, panels and, since 2007, the ICDM data mining contest.

Important Dates:

ICDM contest proposals: April 30
Conference full paper submissions: June 18
Demo and tutorial proposals: August 10
Workshop paper submissions: August 10
PhD Forum paper submissions: August 10
Conference paper, tutorial, demo notifications: September 18
Workshop paper notifications: October 1
PhD Forum paper notifications: October 1
Camera-ready copies and copyright forms: October 15

April 22, 2012

AI & Statistics 2012

Filed under: Artificial Intelligence,Machine Learning,Statistical Learning,Statistics — Patrick Durusau @ 7:08 pm

AI & Statistics 2012 (La Palma, Canary Islands)

Proceedings:

http://jmlr.csail.mit.edu/proceedings/papers/v22/

As one big file:

http://jmlr.csail.mit.edu/proceedings/papers/v22/v22.tar.gz

Why you should care:

The fifteenth international conference on Artificial Intelligence and Statistics (AISTATS 2012) will be held on La Palma in the Canary Islands. AISTATS is an interdisciplinary gathering of researchers at the intersection of computer science, artificial intelligence, machine learning, statistics, and related areas. Since its inception in 1985, the primary goal of AISTATS has been to broaden research in these fields by promoting the exchange of ideas among them. We encourage the submission of all papers which are in keeping with this objective.

The conference runs April 21 – 23, 2012. Sorry!

You will enjoy looking over the papers!

Finding New Story Links Through Blog Clustering

Filed under: Blogs,Clustering,Searching — Patrick Durusau @ 7:08 pm

Finding New Story Links Through Blog Clustering

Matthew Hurst writes:

The basic mechanism used in track // microsoft to cluster articles is similar to that used by Techmeme. A fixed set of blogs are crawled and clustered based on specific features such as link structure and content (and in the case of Techmeme, additional human input). However, what about blogs that aren't known to the system?

I recently added a feature to track // microsoft which analyses clusters for popular urls and adds those to the bottom of the cluster. The title of the web page is used as a simple description of the popular page.

In the recent story about Nuno Silva's mistaken comment regarding the future of Windows Phone devices, there were many links to Nuno's own blog post. In addition to the large cluster of known blogs that were determined to be talking about the story, track // microsoft also surfaced Nuno's post through analysing the popular links discovered within the cluster.

Interesting blog discovery method.

Flexibility to Discover…

Filed under: Astroinformatics,Bayesian Data Analysis,Bayesian Models — Patrick Durusau @ 7:08 pm

David W. Hogg writes:

If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity.

Context to follow but think about that for a minute.

Do you want to discover structures or confirm what you already believe to be present?

In context:

On day zero of AISTATS, I gave a workshop on machine learning in astronomy, concentrating on the ideas of (a) trusting unreliable data and (b) the necessity of having a likelihood, or probability of the data given the model, making use of a good noise model. Before me, Zoubin Ghahramani gave a very valuable overview of Bayesian non-parametric methods. He emphasized something that was implied to me by Brendon Brewer’s success on my MCMC High Society challenge and mentioned by Rob Fergus when we last talked about image modeling, but which has rarely been explored in astronomy: If you want to have the flexibility to discover correct structure in your data, you may have to adopt methods that permit variable model complexity. The issues are two-fold: For one, a sampler or an optimizer can easily get stuck in a bad local spot if it doesn’t have the freedom to branch more model complexity somewhere else and then later delete the structure that is getting it stuck. For another, if you try to model an image that really does have five stars in it with a model containing only four stars, you are requiring that you will do a bad job! Bayesian non-parametrics is this kind of argument on speed, with all sorts of processes named after different kinds of restaurants. But just working with the simple dictionary of stars and galaxies, we could benefit from the sampling ideas at least. (emphasis added)

Isn’t that awesome? With all the astronomy data that is coming online? (With lots of it already online.)

Not to mention finding structures in other data as well. Maybe even in “big data.”

Big Data and the Coming Conceptual Model Revolution

Filed under: BigData,Conceptualizations — Patrick Durusau @ 7:07 pm

Big Data and the Coming Conceptual Model Revolution

Malcolm Chisholm writes:

Conceptual models must capture all business concepts and all relevant relationships. If instances of things are also part of the business reality, they must be captured too. Unfortunately, there is no standard methodology and notation to do this. Conceptual models that communicate business reality effectively require some degree of artistic imagination. They are products of analysis, not of design.(emphasis added)

That’s the trick isn’t it? Developing a good conceptual model.

You can have system requirements for multiple Terabytes of data storage, Gigabytes of bandwidth, messages and processes galore, but if you don’t have a good conceptual model, it’s just so much hardware junk.

Are you planning your system based on hardware or software capabilities?

Or are you developing a conceptual model you want to implement in hardware and software?

Which one do you think will come closer to meeting your needs?

DensoDB Is Out

Filed under: .Net,C# — Patrick Durusau @ 7:07 pm

DensoDB Is Out

From the website:

DensoDB is a new NoSQL document database. Written for .Net environment in c# language.

It’s simple, fast and reliable. More details on github https://github.com/teamdev/DensoDB

You can use it in three different ways:

  1. InProcess: No need of service installation and communication protocol. The fastest way to use it. You have direct access to the DataBase memory and you can manipulate objects and data in a very fast way.
  2. As a Service: Installed as Windows Service, it can be used as a network document store.You can use rest service or wcf service to access it. It’s not different from the previuos way to use it but you have a networking protocol and so it’s not fast as the previous one.
  3. On a Mesh: mixing the previous two usage mode with a P2P mesh network, it can be easily syncronizable with other mesh nodes. It gives you the power of a distributed scalable fast database, in a server or server-less environment.

You can use it as a database for a stand alone application or in a mesh to share data in a social application. The P2P protocol for your application and synchronization rules will be transparent for you, and you’ll be able to develop all your application as it’s stand-alone and connected only to a local DB.

I don’t work in a .Net environment but am interested in experiences with .Net based P2P mesh networks and topic maps.

At some point I should setup a smallish Windows network with commodity boxes. Perhaps I could make all of them dual (or triple) boot so I could switch between distributed networks. If you have software or a box you would like to donate to the “cause” as it were, please be in touch.

The wrong way: Worst best practices in ‘big data’ analytics programs

Filed under: Analytics,BigData,Government,Government Data — Patrick Durusau @ 7:07 pm

The wrong way: Worst best practices in ‘big data’ analytics programs

Rick Sherman writes:

“Big data” analytics is hot. Read any IT publication or website and you’ll see business intelligence (BI) vendors and their systems integration partners pitching products and services to help organizations implement and manage big data analytics systems. The ads and the big data analytics press releases and case studies that vendors are rushing out might make you think it’s easy — that all you need for a successful deployment is a particular technology.

If only it were that simple. While BI vendors are happy to tell you about their customers who are successfully leveraging big data for analytics uses, they’re not so quick to discuss those who have failed. There are many potential reasons why big data analytics projects fall short of their goals and expectations. You can find lots of advice on big data analytics best practices; below are some worst practices for big data analytics programs so you know what to avoid.

Rick gives seven reasons why “big data” analytics projects fail:

  1. “If we build, it they will come.”
  2. Assuming that the software will have all the answers.
  3. Not understanding that you need to think differently.
  4. Forgetting all the lessons of the past.
  5. Not having the requisite business and analytical expertise.
  6. Treating the project like it’s a science experiment.
  7. Promising and trying to do too much.

Seven reasons that should be raised when the NSA Money Trap project fails.

Because no one has taken responsibility for those seven issues.

Or asked the contractors: What about your failed “big data” analytics projects?

Simple enough question.

Do you ask that question?

Texas Library Association: The Next Generation of Knowledge Management

Filed under: Law,Legal Informatics — Patrick Durusau @ 7:06 pm

Texas Library Association: The Next Generation of Knowledge Management

Greg Lambert writes:

I had the honor of presenting to at the Texas Library Association Conference here in Houston today. The topic was on Library and Knowledge Management’s collaborative roles within a firm, and how they can work together to bring in better processes, automate certain manual procedures, and add analyze data in a way that makes it (and as a result, KM and Library) more valuable.

Below are the thoughts I wrote down to discuss six questions. These questions were raised at the ARK KM meeting earlier this year and, although the audience was substantially different, I thought it would be a good reference point to cover what is expected of us, and how we can contribute to the operations of the firm in unexpected ways. Thanks to Sean Luman for stepping in and co-presenting with me after Toby suddenly had a conflict.

[Note: Click here to see the Prezi that went along with the presentation.]

My first time to see a “Prezi.” See what you think about it. Comments?

BTW, I thought the frame with:

Lawyers like to think all work is “custom” work.

Clients tend to think most work is “repetitive” (but lawyers are still charging as if it is custom work.

Was quite amusing. I suspect the truth lies somewhere in between those two positions.

I think topic maps can help to integrate not only traditional information sources with case analysis, pleadings, discovery, but non-traditional resources as well. News sources for example. Government agency rulings, opinions, treatment of similarly situated parties. The current problem being that an attorney has to search separate resources for all of those sources of information and more.

Skillful collation of diverse information sources using topic maps would allow attorneys to bill at full rate for the exercise of their knowledge and analytical skills, while eliminating charges for largely rote work of ferreting out resources to be analyzed.

For example, a patent topic map in a particular area, could deliver to a patent attorney just those portions of patents that are relevant for their review, not all patents in a searched area or even the full patents. And the paths taken on the analysis of one patent, could be available to other attorneys in the same firm, enabling a more efficient response to later queries in a particular area (think of it as legal bread crumbs).

GraphLab: Workshop on Big Learning

Filed under: Algorithms,BigData,Machine Learning — Patrick Durusau @ 7:06 pm

GraphLab: Workshop on Big Learning

Monday, July 9, 2012

From the webpage:

The GraphLab workshop on large scale machine learning is a meeting place for both academia and industry to discuss upcoming challenges of large scale machine learning and solution methods. GraphLab is Carnegie Mellon’s large scale machine learning framework. The workshop will include demos and tutorials showcasing the next generation of the GraphLab framework, as well as lectures and demos from the top technology companies about their applied large scale machine learning solutions.

and

There is a related workshop on Algorithms for Modern Massive Datasets at Stanford, immediately after the GraphLab workshop.

If you are going to be in the Bay area, definitely a good way to start the week!

Open Government Data

Filed under: Data,Government Data,Open Data — Patrick Durusau @ 7:06 pm

Open Government Data by Joshua Tauberer.

From the website:

This book is the culmination of several years of thinking about the principles behind the open government data movement in the United States. In the pages within, I frame the movement as the application of Big Data to civics. Topics include principles, uses for transparency and civic engagement, a brief legal history, data quality, civic hacking, and paradoxes in transparency.

Johshua’s book can be ordered in hard copy, ebook, or viewed online for free.

You may find this title useful in discussions of open government data.

The Public Library of Law

Filed under: Law - Sources,Legal Informatics — Patrick Durusau @ 7:06 pm

The Public Library of Law

From the website:

Searching the Web is easy. Why should searching the law be any different? That’s why Fastcase has created the Public Library of Law — to make it easy to find the law online. PLoL is one of the largest free law libraries in the world, because we assemble law available for free scattered across many different sites — all in one place. PLoL is the best starting place to find law on the Web.

Well…., yes, I suppose “[s]earching the Web is easy” but getting useful results is not.

Getting useful results from searching the law is even more difficult. Far more difficult.

The Federal Rules of Civil Procedure (US Federal Courts) run just under one hundred pages (one sixty-eight (168) with forms). For law student there is The Law of Federal Courts, 7th Ed. by Charles A. Wright and Mary Kay Kane at ten (10) times that long and is a blizzard of case citations and detailed analysis. Professionals use Federal Practice and Procedure, Wright & Miller, which covers criminal and other aspects of federal procedure, at thirty-one volumes. A professional would also be using other resources of equal depth to Wright & Miller on relevant legal issues.

I fully support what the Public Library of Law is trying to do. But, want you to be aware that useful legal research requires more than finding language you like or happen to agree with. Perhaps more than most places, in law words don’t always mean what you think they mean. And vary from place to place more than you would expect.

Deeply fascinating reading awaits you but if you need legal advice, there is no substitute for consulting someone with professional training who reads the law everyday.

I have included the PLoL here because I think topic maps have a tremendous potential for legal research and practice.

Imagine:

  • Mapping case analysis, law, to pleadings, depositions, etc.
  • Mapping pleadings, motions, etc. to particular trial judges.
  • Mapping appeals decisions to particular trial judges and attorneys.
  • Mapping appeals decisions to detailed case facts.
  • Mapping appeals decisions to judges and attorneys.
  • Recording paths through depositions to other evidence.
  • Mapping different terminologies between witnesses.
  • Mapping portions of pleadings, discovery, etc., to specific facts, courts.
  • Harvesting anecdotal stories to create internal resources.
  • Or creating a service that offers one or more of these services to attorneys.
« Newer PostsOlder Posts »

Powered by WordPress