Similar Pages for Wikipedia – Lateral – Indexing Practices

Similar Pages for Wikipedia (Chrome extension)

I started looking at this software with a mis-impression that I hope you can avoid.

I installed the extension and as advertised, if I am on a Wikipedia page, it recommends “similar” Wikipedia pages.

Unless I’m billing time, plowing through page after page of tangentially related material isn’t my idea of a good time.

Ah, but I confused “document” with “page.”

I discovered that error while reading Adding Documents at Lateral, which gives the following example:

lateral-add-doc-example

Ah! So “document” means as much or as little text as I choose to use when I add the document.

Which means if I were creating a document store of graph papers, I would capture only the new material and not the inevitable a “graph consists of nodes and edges….”

There are pre-populatd data sets, News 350,000+ news and blog articles, updated every 15 mins; arXiv 1M+ papers (all), updated daily; PubMed 6M+ medical journals from before July 2014; SEC 6,000+ yearly financial reports / 10-K filings from 2014; Wikipedia 463,000 pages which had 20+ page views in 2013.

I suspect the granularity on the pre-populated data sets is “document” in the usual sense size.

Glad to see the option to define a “document” to be an arbitrary span of text.

I don’t need to find more “documents” (in the usual sense) but more relevant snippets that are directly on point.

Hmmm, perhaps indexing at the level of paragraphs instead of documents (usual sense)?

Which makes me wonder why we index at the level of documents (usual sense) anyway? Is it simply tradition from when indexes were prepared by human indexers? And indexes were limited by physical constraints?

Comments are closed.