Rolling out a “How-To” Software Series
A long preface that ends with a list of posts on “how to” use NodeXL.
Looks very good!
Enjoy!
Rolling out a “How-To” Software Series
A long preface that ends with a list of posts on “how to” use NodeXL.
Looks very good!
Enjoy!
Finding Shakespeare’s Favourite Words With Data Explorer by Chris Webb.
From the post:
The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…
In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.
If Data Explorer is a GUI on top of M (outdated but a point of origin), it goes up in importance.
From the M link:
The Microsoft code name “M” Modeling Language, hereinafter referred to as M, is a language for modeling domains using text. A domain is any collection of related concepts or objects. Modeling domain consists of selecting certain characteristics to include in the model and implicitly excluding others deemed irrelevant. Modeling using text has some advantages and disadvantages over modeling using other media such as diagrams or clay. A goal of the M language is to exploit these advantages and mitigate the disadvantages.
A key advantage of modeling in text is ease with which both computers and humans can store and process text. Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The language feature of M enables information to be represented in a textual form that is tuned for both the problem domain and the target audience. The M language provides simple constructs for describing the shape of a textual language – that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, M acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage.
I try to not run examples using Shakespeare. I get distracted by the elegance of the text, which isn’t the point of the exercise.
Top 10 Formulas for Aspiring Analysts by Purna “Chandoo” Duggirala.
From the post:
Few weeks ago, someone asked me “What are the top 10 formulas?” That got me thinking.
While each of us have our own list of favorite, most frequently used formulas, there is no standard list of top 10 formulas for everyone. So, today let me attempt that.
If you want to become a data or business analyst then you must develop good understanding of Excel formulas & become fluent in them.
A good analyst should be familiar with below 10 formulas to begin with.
A reminder that not all data analysis starts with the most complex chain of transformations you can imagine.
Sometimes you need to explore and then roll out the heavy weapons.
Update: TabLinker & UnTabLinker
From the post:
TabLinker, introduced in an earlier post, is a spreadsheet to RDF converter. It takes Excel/CSV files as input, and produces enriched RDF graphs with cell contents, properties and annotations using the DataCube and Open Annotation vocabularies.
TabLinker interprets spreadsheets based on hand-made markup using a small set of predefined styles (e.g. it needs to know what the header cells are). Work package 6 is currently investigating whether and how we can perform this step automatically.
Features:
- Raw, model-agnostic conversion from spreadsheets to RDF
- Interactive spreadsheet marking within Excel
- Automatic annotation recognition and export with OA
- Round-trip conversion: revive the original spreadsheet files from the produced RDF (UnTabLinker)
Even with conversion tools, the question has to be asked:
What was gained by the conversion? Yes, yes the data is now an RDF graph but what can I do now that I could not do before?
With the caveat that it has to be something I want to do.
From the post:
Bringing Hadoop to Windows and the two-year development of Hadoop 2.0 are two of the more exciting developments brought up by Hortonworks’s Cofounder and CTO, Eric Baldeschwieler, in a talk before a panel at the Cloud 2012 Conference in Honolulu.
(video omitted)
The panel, which was also attended by Baldeschwieler’s Cloudera counterpart Amr Awadallah, focused on insights into the big data world, a subject Baldeschwieler tackled almost entirely with Hadoop. The eighteen-minute discussion also featured a brief history of Hadoop’s rise to prominence, improvements to be made to Hadoop, and a few tips to enterprising researchers wishing to contribute to Hadoop.
“Bringing Hadoop to Windows,” says Baldeschwieler “turns out to be a very exciting initiative because there are a huge number of users in Windows operating system.” In particular, the Excel spreadsheet program is a popular one for business analysts, something analysts would like to see integrated with Hadoop’s database. That will not be possible until, as Baldeschwieler notes, Windows is integrated into Hadoop later this year, a move that will also considerably expand Hadoop’s reach.
However, that announcement pales in comparison to the possibilities provided by the impending Hadoop 2.0. “Hadoop 2.0 is a pretty major re-write of Hadoop that’s been in the works for two years. It’s now in usable alpha form…The real focus in Hadoop 2.0 is scale and opening it up for more innovation.” Baldeschwieler notes that Hadoop’s rise has been result of what he calls “a happy accident” where it was being developed by his Yahoo team for a specific use case: classifying, sorting, and indexing each of the URLs that were under Yahoo’s scope.
Integration of Excel and Hadoop?
Is that going to be echoes of Unix – The Hole Hawg?
From the post:
Mirai Solutions GmbH (http://www.mirai-solutions.com) is very pleased to announce the release of XLConnect 0.2-0, which can be found at CRAN.
As one of the updates, XLConnect has moved to the newest release of Apache POI: 3.8. Also, the lazy evaluation issues with S4 generics are now fixed: generic methods now fully expand the argument list in order to have the arguments immediately evaluated.
Furthermore, we have added an XLConnect.R script file to the top level library directory, which contains all code examples presented in the vignette, so that it’s easier to reuse the code.
From an earlier description of XLConnect:
XLConnect is a comprehensive and platform-independent R package for manipulating Microsoft Excel files from within R. XLConnect differs from other related R packages in that it is completely cross-platform and as such runs under Windows, Unix/Linux and Mac (32- and 64-bit). Moreover, it does not require any installation of Microsoft Excel or any other special drivers to be able to read & write Excel files. The only requirement is a recent version of a Java Runtime Environment (JRE). Also, XLConnect can deal with the old *.xls (BIFF) and the new *.xlsx (Office Open XML) file formats. Under the hood, XLConnect uses Apache POI (http://poi.apache.org) – a Java API to manipulate Microsoft Office documents. (From XLConnect – A platform-independent interface to Excel
If you work with data in a business environment, you are going to encounter Excel files. (Assuming you are not in a barter economy counting animal skins and dried fish.)
And customers are going to want you to return Excel files to them. (Yes, yes, topic maps would be a much better delivery format. But if the choice is Excel files, you get paid, topic maps files, you don’t get paid, which one would you do? That’s what I thought.)
A package to consider if you need to manipulate Excel files from within R.
Chris Webb writes up a quick BI solution in Excel 2013:
Building a Simple BI Solution in Excel 2013, Part 1
and
Building a Simple BI Solution in Excel 2013, Part 2
In the process Chris uncovers some bugs and disappointments, but on the whole the application works.
I mention it for a couple of reasons.
If you recall, something like 75% of the BI market is held by Excel. I don’t expect that to change any time soon.
What do you think happens when “self-service” BI applications are created by users? Other than becoming the default applications for offices and groups in organizations?
Will different users are going to make different choices with their Excel BI applications?
Will users with different Excel BI applications resort to knives, if not guns, to avoid changing their Excel BI applications?
Excel in its many versions leads to varying and inconsistent “self-service” applications in 75% of the BI marketplace.
Is it just me or does that sound like an opportunity for topic maps to you?
Data Mining In Excel: Lecture Notes and Cases (2005) by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce.
From the introduction:
This book arose out of a data mining course at MIT’s Sloan School of Management. Preparation for the course revealed that there are a number of excellent books on the business context of data mining, but their coverage of the statistical and machine-learning algorithms that underlie data mining is not sufficiently detailed to provide a practical guide if the instructor’s goal is to equip students with the skills and tools to implement those algorithms. On the other hand, there are also a number of more technical books about data mining algorithms, but these are aimed at the statistical researcher, or more advanced graduate student, and do not provide the case-oriented business focus that is successful in teaching business students.
Hence, this book is intended for the business student (and practitioner) of data mining techniques, and its goal is threefold:
- To provide both a theoretical and practical understanding of the key methods of classification, prediction, reduction and exploration that are at the heart of data mining;
- To provide a business decision-making context for these methods;
- Using real business cases, to illustrate the application and interpretation of these methods.
An important feature of this book is the use of Excel, an environment familiar to business analysts. All required data mining algorithms (plus illustrative datasets) are provided in an Excel add-in, XLMiner. XLMiner offers a variety of data mining tools: neural nets, classification and regression trees, k-nearest neighbor classification, naive Bayes, logistic regression, multiple linear regression, and discriminant analysis, all for predictive modeling. It provides for automatic partitioning of data into training, validation and test samples, and for the deployment of the model to new data. It also offers association rules, principal components analysis, k-means clustering and hierarchical clustering, as well as visualization tools, and data handling utilities. With its short learning curve, affordable price, and reliance on the familiar Excel platform, it is an ideal companion to a book on data mining for the business student.
Some what dated but remember there are lots of older copies of MS Office around. Not an inconsiderable market if you start to write something on using Excel to produce topic maps. Write for the latest version but I would have a version keyed to earlier versions of Excel as well.
I first saw this at KDNuggets.
Data Mining In Excel: Lecture Notes and Cases by Yanchang Zhao.
Table of contents (270 page book)
You knew that someday all those Excel files would be useful!
Well, today may be the day!
A bit dated, 2005, but should be a good starting place.
If you are interested in learning data mining in Excel cold, try comparing the then capacities of Excel to the current version of Excel and updating the text/examples.
Best way to learn it is to update and then teach it to others.
50 Open Source Replacements for Proprietary Business Intelligence Software by Cynthia Harvey.
From the post:
In a recent Gartner survey, CIOs picked business intelligence and analytics as their top technology priority for 2012. The market research firm predicts that enterprises will spend more than $12 billion on business intelligence (BI), analytics and performance management software this year alone.
As the market for business intelligence solutions continues to grow, the open source community is responding with a growing number of applications designed to help companies store and analyze key business data. In fact, many of the best tools in the field are available under an open source license. And enterprises that need commercial support or other services will find many options available.
This month, we’ve put together a list of 50 of the top open source business intelligence tools that can replace proprietary solutions. It includes complete business intelligence platforms, data warehouses and databases, data mining and reporting tools, ERP suites with built-in BI capabilities and even spreadsheets. If we’ve overlooked any tools that you feel should be on the list, please feel free to note them in the comments section below.
A very useful listing of “replacements” for proprietary software in part because it includes links to the software to be replaced.
You will find it helpful in identifying software packages with common goals but diverse outputs, grist for topic map mills.
I tried to find a one-page display (print usually works) but you will have to endure the advertising clutter to see the listing.
PS: Remember that MS Excel seventy-five (75%) percent of the BI market. Improve upon/use an MS Excel result, you are closer to a commercially viable product. (BI’s Dirty Secrets – Why Business People are Addicted to Spreadsheets)
NodeXL: Network Overview, Discovery and Exploration for Excel
From the webpage:
NodeXL is a free, open-source template for Microsoft® Excel® 2007 and 2010 that makes it easy to explore network graphs. With NodeXL, you can enter a network edge list in a worksheet, click a button and see your graph, all in the familiar environment of the Excel window.
NodeXL Features
- Flexible Import and Export Import and export graphs in GraphML, Pajek, UCINet, and matrix formats.
- Direct Connections to Social Networks Import social networks directly from Twitter, YouTube, Flickr and email, or use one of several available plug-ins to get networks from Facebook, Exchange and WWW hyperlinks.
- Zoom and Scale Zoom into areas of interest, and scale the graph’s vertices to reduce clutter.
- Flexible Layout Use one of several “force-directed” algorithms to lay out the graph, or drag vertices around with the mouse. Have NodeXL move all of the graph’s smaller connected components to the bottom of the graph to focus on what’s important.
- Easily Adjusted Appearance Set the color, shape, size, label, and opacity of individual vertices by filling in worksheet cells, or let NodeXL do it for you based on vertex attributes such as degree, betweenness centrality or PageRank.
- Dynamic Filtering Instantly hide vertices and edges using a set of sliders—hide all vertices with degree less than five, for example.
- Powerful Vertex Grouping Group the graph’s vertices by common attributes, or have NodeXL analyze their connectedness and automatically group them into clusters. Make groups distinguishable using shapes and color, collapse them with a few clicks, or put each group in its own box within the graph. “Bundle” intergroup edges to make them more manageable.
- Graph Metric Calculations Easily calculate degree, betweenness centrality, closeness centrality, eigenvector centrality, PageRank, clustering coefficient, graph density and more.
- Task Automation Perform a set of repeated tasks with a single click.
Homepage for NodeXL, which uses Excel as the framework for display and exploration of graphs.
There is something to be said about software that ties itself to other successful software. I think that is “increased chances of success.” Don’t you?
Reading Excel data is easy with JGR and XLConnect
From the post:
Despite the fact that Excel is the most widespread application for data manipulation and (perhaps) analysis, R’s support for the xls and xlsx file formats has left a lot to be desired. Fortunately, the XLConnect package has been created to fill this void, and now JGR 1.7-8 includes integration with XLConnect package to load .xls and .xlsx documents into R.
For JGR, see: http://rforge.net/JGR/
JavaScript Console and Excel Coming to Hadoop
Alex Popescu (myNoSQL) has pointers to news of Hadoop on Windows Azure. Opens Hadoop up to Javascript developers and Excel/PowerPivot users.
Alex captures the winning strategy for new technologies when he says:
Think of integration with familiar tools and frameworks as a huge adoption accelerator.
What would it look like to add configurable merging on PowerPivot? (I may have to get a copy of MS Office 2010.)