tokenising the visible english text of common crawl

December 11, 2011

tokenising the visible english text of common crawl

Filed under: Cloud Computing,Dataset,Natural Language Processing — Patrick Durusau @ 10:20 pm

tokenising the visible english text of common crawl by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Well, 30TB of data, that certainly sounds like a small project.

What small amount of data are you using for your next project?

Comments Off

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Pages:
Blogroll
Categories:
- .Net
- #BLM
- #DAPL
- #gamergate
- 1000 Genomes
- 3D Printing
- 4store
- A/B Tests
- Access Points
- Accumulo
- ActionGenerator
- Active Learning
- ActiveSpaces
- Actor-Based
- ActorFx
- Acunu
- Ad Targeting
- Adams
- Adaptive Networks
- Additivity
- Adjacency List
- ADMS
- ADO.Net Entity Framework
- Adversarial Learning
- Advertising
- Aerospike
- agamemnon
- Agda
- Agents
- Aggregation
- Agriculture
- AgroTagger
- AGROVOC
- Ajax
- Akiban Persistit
- Akka
- Alchemy Database
- AlchemyAPI
- Algebird
- Algebra
- Algebraic Geometry
- Algorithms
- Alignment
- AllegroGraph
- Amazon Aurora
- Amazon CloudSearch
- Amazon DynamoDB
- Amazon EMR
- Amazon Web Services AWS
- Ambari
- Ambiguity
- Ambrose
- Anaconda
- Analog Computing
- Analytics
- Ancient World
- AnnotateIt
- Annotation
- Annotator
- ANTLR
- Apache Ambari
- Apache Calcite
- Apache Camel
- Apache Crunch
- Apache Flink
- Apache Ignite
- Apache Marmotta
- Apache Ranger
- Apache S4
- Apache Tajo
- Apache Velocity
- Applied Topology
- Approximate Nearest Neighbors (ANN)
- Arabic
- ArangoDB
- Aranuka
- Architecture
- Archives
- Argumentation
- Argumentation Mining
- ARM
- Arrays
- Art
- Artificial Intelligence
- ASCII
- Asgard
- Assembly
- Association Rule Mining
- Associations
- Associative Classification Mining
- Associative Model
- AsterixDB
- Astroinformatics
- Attribution
- Audio
- Auditing
- Augmented Reality
- Aurelius Graph Cluster
- Authoring Semantics
- Authoring Topic Maps
- Authority Record
- Auto Tagging
- AutoComplete
- Automata
- AutoSuggestion
- Aviation
- AvocadoDB
- Avro
- Awk
- AZOrange
- Azure Marketplace
- B-trees
- B+Tree
- BabelNet
- Bag-of-Words (BOW)
- BaseX
- Bash
- Bayesian Data Analysis
- Bayesian Models
- BBC
- Behemoth
- Benchmarks
- BerkeleyDB
- BI
- Bias
- BIBFRAME
- Bible
- Bibliography
- Bibliometrics
- BibTeX
- Big Query
- BigCouch
- BigData
- bigdata®
- BigInsights
- Bigtop
- Binary Distance
- Binary Relations
- Binary Search
- Binary Similarity
- Binary Tags
- Bing
- Bio-Linux
- Bio4j
- Biodiversity
- Biography
- Bioinformatics
- Biology
- Biomedical
- Biometrics
- Biostatistics
- Biplots
- BIRT
- Bisociative
- Bitly
- Bitmap Indexes
- Bitsy
- BitTorrent
- Black Literature
- Blacklight
- BLAST
- Blazegraph
- blekko
- BlinkDB
- Blocky
- Blogs
- Bloom Filters
- Bloom Language
- Blueprints
- Bobo Search
- Bookmarking
- Bookmarks
- Books
- Boolean Functions
- Boolean Operators
- Boost Graph Library
- Boosting
- Bots
- Bregman Divergences
- BrightstarDB
- Brisk
- British Library
- British Museum
- British National Bibliography
- Broadcasting
- Broccoli
- Browsers
- Bug Prediction
- Bugs
- Bulk Synchronous Parallel (BSP)
- Burrows-Wheeler Transform (BWT)
- Business Intelligence
- C-Rank
- C/C++
- C#
- Cache Invalidation
- Cache-Oblivious Search Trees
- Calvin
- Cambridge Advanced Modeler (CAM)
- Camilstore
- CAP
- Cardinality Estimation
- CartoDB
- Cartogram
- Cartography
- Cascading
- Cascalog
- Cassandra
- Cassovary
- Cataloging
- Categorical Data
- Category Theory
- Cayley
- CCNx
- Cell Architecture
- Cell Stores
- Cellular Automata
- Censorship
- Census Data
- CERN
- CFinder
- Challenges
- Change Data
- Chaos
- Charts
- Chemical Markup Language (CML)
- Cheminformatics
- Chemistry
- CHI
- Chilcot Report (Iraq)
- Chinese
- Chip Hacking
- Chordalysis
- Chorus
- ChuQL
- Church
- CIA
- CIEL
- Cinder
- Citation Analysis
- Citation Indexing
- Citation Practices
- CJK
- CKAN
- Class Admin
- Classics
- Classification
- Classification Trees
- Classifier
- Classifier Fusion
- Click Graph
- Climate Data
- Climate Informatics
- Clojure
- ClojureScript
- Closure Table
- Cloud Computing
- Cloudera
- ClueWeb2012
- Clustering
- Clustering (servers)
- Clydesdale
- CMDB
- CMS
- co-occurrence
- Co-Words
- CockroachDB
- Code Lists
- CodernityDB
- Colin Powell Emails
- Collaboration
- Collaborative Annotation
- Collation
- Collection Pipeline
- Collocation
- Collocative Integrity
- Column-Oriented
- Columnar Database
- Combinatorics
- Common Ancestor
- Common Crawl
- Communication
- Communities of Practice
- Compass
- CompChem
- Competence
- Compilers
- Complex Networks
- Complexity
- Compression
- Computation
- Computational Biology
- Computational Geometry
- Computational Linguistics
- Computational Literary Analysis
- Computational Photography
- Computational Semantics
- Computational Statistics
- Computer Fraud and Abuse (CFAA)
- Computer Science
- Concept Detection
- Concept Drift
- Concept Hierarchies
- Concept Maps
- Conceptualizations
- Concord
- Concordance
- Concurrent Programming
- Conferences
- Confidence Bias
- Conjunctive Query
- Connectome
- Consensus
- Consistency
- Constello
- Constraint Programming
- Content Analysis
- Content Management System (CMS)
- Contest
- Context
- Context Models
- Context-aware
- Continuous Integration
- Conversion
- Cooperation
- Coq
- Coreference Resolution
- Corpora
- Corporate Data
- Corporate Memory
- Corpus Linguistics
- Correlation
- Cosine Similarity
- Couchbase
- CouchDB
- CouchTM
- Counterfactual
- CQL – Cassandra Query Language
- Cray
- CRDT
- Crisp Sets
- CRISP-DM
- Critical Reading
- CRM
- Cross-lingual
- Crosswalk
- Crosswalks
- Crossword Puzzle
- Crowd Sourcing
- CRS
- CRUD
- Crunch
- CryptoCurrency
- Cryptography
- Cryptome
- CS Lectures
- CSS3
- CSV
- CTF
- CTM
- Ctools
- Cubert
- CUBRID
- CUDA
- Cultural Anthropology
- Cuneiform
- Curation
- Curiosity
- CXAIR
- CXTM
- Cybersecurity
- Cyc
- Cypher
- D Language
- D3
- DAG
- Damerau-Levenshtein Edit Distance
- Dark Data
- Dark Web
- DARPA
- Dart
- Dashboard
- Data
- DATA Act
- Data Aggregation
- Data Analysis
- Data as Service (DaaS)
- Data Attribution
- Data Auditing
- Data Citation
- Data Clustering
- Data Collection
- Data Contamination
- Data Contest
- Data Conversion
- Data Cubes
- Data Documentation Initiative (DDI)
- Data Engine
- Data Explorer
- Data Factorization
- Data Frames
- Data Fusion
- Data Governance
- Data Integration
- Data Locality
- Data Management
- Data Mining
- Data Models
- Data Pipelines
- Data Preservation
- Data Provenance
- Data Quality
- Data Reduction
- Data Replication
- Data Repositories
- Data Science
- Data Science Toolkit (DSTK)
- Data Silos
- Data Source
- Data Storytelling
- Data Streams
- Data Structures
- Data Types
- Data Virtualization
- Data Warehouse
- Data Without Borders
- Data-Scope
- Database
- Databus
- DataCaml
- DataCleaner
- DataFu
- DataJS
- DataKind
- Datalog
- Datamash
- Datameer
- Dataset
- DataStax
- Dataverse Network
- Dato
- Datomic
- Daytona
- DBpedia
- DCAT
- DCIP
- De Bruijn Graphs
- DEAP
- Debate
- Decentralized Internet
- Deception
- Decision Making
- Deductive Databases
- Deduplication
- Deep Learning
- Deep Web
- Defense
- Degree Program
- Deja vu
- Delite
- Demographics
- Dempsy
- Dendrite
- Denotational Semantics
- Dependency
- Dependency Graphs
- Description Logic
- Design
- Design Patterns
- Dewey – DDC
- DEX
- DHash
- DiaGen
- DiaMeta
- Diaphors
- Dictionary
- Digital Culture
- Digital Library
- Digital Research
- Dimension Reduction
- Dimensions
- Dimple
- Diplodocus
- Dirichlet Processes
- Disambiguation
- Disassortativity
- Disco
- Discourse
- DiscoverText
- Discovery Informatics
- Discrete Structures
- Disjunction (Widdows)
- Distance
- Distributed Computing
- Distributed Consistency
- Distributed Indexing
- Distributed RAM
- Distributed Sensemaking
- Distributed Systems
- Distributional Semantics
- Diversity
- Django
- DNA
- Document Classification
- Document Management
- Document Retention
- Documentation
- DocumentLens
- DOI
- DOM4
- Domain Change
- Domain Driven Design
- Domain Expertise
- Domain-Specific Languages
- Domesday Book
- Dr. Who
- Drake
- Dremel
- Drill
- Drizzle
- DRM
- Drug Discovery
- Druid
- Drupal
- Dryad
- DSL
- DSpace
- DSPL – Dataset Publishing Language
- DTS
- dtSearch
- Dublin Core
- Duke
- Duplicates
- Dwarf Cubes
- Dydra
- Dynamic Graphs
- Dynamic Mapping
- Dynamic Updating
- Dynamo
- e-Discovery
- EADitor
- eBay
- eBooks
- Eclipse
- Ecoinformatics
- eDiscovery
- Edit Distance
- Editor
- Education
- eGov
- Ehcache
- Elastic Map Reduce (EMR)
- ElasticSearch
- Electronic Frontier Foundation
- Electronic Records Management
- ElephantDB
- Elixir
- Elliptics
- Elm
- ELN Integration
- Emacs
- Email
- Emergent Semantics
- EmotionML
- Encoded Archival Description (EAD)
- Encog
- Encryption
- Encyclo
- Encyclopedia
- Endeca
- Enrichment
- Enron
- Ensemble Methods
- Enterprise Integration
- Enterprise Service Bus (ESB)
- Entertainment
- Entities
- Entity Extraction
- Entity Resolution
- Entity Salience
- Environment
- Epistemology
- EPUB3
- Equivalence Class
- Erjang
- Erlang
- Erotica
- Error Correction
- Esper
- ESPN
- eTBLAST
- Ethics
- Ethnological
- ETL
- EU
- European Parliment Proceedings Corpus
- Europeana
- Evaluation
- Event Stream
- EventMachine
- Evidential Logic
- Evoluntionary
- EWAB
- Examples
- Excel
- Excel Datascope
- Exercises
- eXist
- Explain.solr.pl
- Explicit Semantic Analysis
- Exploratory Data Analysis
- Expresso
- Expressor
- Extraction
- Extrinsic Semantics
- F-Score
- F#
- F1
- FAA
- Face Detection
- Facebook
- Faceted Search
- Facets
- Factor Analysis
- Factor Graphs
- Factorised Databases
- Fair Use
- Fake News
- Falcon
- FAST
- Fast Singular Value Decomposition
- FastBit
- Faunus
- FBI
- Feature Learning
- Feature Spaces
- Feature Vectors
- FEC
- Federated Search
- Federation
- Fellowships
- Feminism
- Ferguson
- Ferret (Ruby)
- Files
- Filters
- Finance Services
- Findability
- Finite State Automata
- Flash Cards
- Flash Storage
- Flex
- FlockDB
- Flow-Based Programming (FBP)
- Flowchart
- Fluentd
- Flume
- FluxGraph
- FM-Indexes
- FOIA
- Folklore
- Folksonomy
- Fonts
- Food
- Forecasting
- Formal Concept Analysis (FCA)
- Formal Methods
- Forth
- Fortress Language
- Forward Index
- FoundationDB
- Foursquare
- Fourth Paradigm
- Fractal Trees
- Fractals
- Frames
- FRBR
- Free Speech
- Freebase
- FreeMind
- FSTs
- Fulgora
- Full-Text Search
- Functional Decomposition
- Functional Genomics
- Functional Programming
- Functional Reactive Programming (FRP)
- Funding
- Furnace
- Fuseki
- Fusion Tables
- Fuzzing
- Fuzzy Logic
- Fuzzy Matching
- Fuzzy Sets
- G-Store (graphs)
- G-Store (Multikey)
- G2 Sensemaking
- GaBP
- Galaxy
- Galry
- Game of Life
- Game Theory
- Games
- Gatling
- gdb
- GDELT
- Gene Ontology
- Genealogy
- Genetic Algorithms
- Genome
- Genomics
- Geo Analytics
- Geo-Indexing
- Geoff
- Geographic Data
- Geographic Information Retrieval
- Geography
- GeoJSON
- Geologic Maps
- Geometry
- GeoNames
- Geophysical
- Georeferencing
- Geospatial Data
- Gephi
- Gephi Blueprints
- ggmap
- Ggplot2
- Ghidra
- GIMP
- Giraph
- GIS
- Gisgraphy
- Git
- Github
- Gizzard
- Globalsdb
- Glossary
- god Architecture
- GoldenOrb
- GoodRelations
- Google Analytics
- Google App Engine
- Google BigQuery
- Google Cloud
- Google Compute Engine
- Google Correlate
- Google CSE
- Google Docs
- Google Earth
- Google Knowledge Graph
- Google Maps
- Google Prediction
- Google Refine
- Google+
- GoogleBooks
- Gora
- Governance
- Government
- Government Data
- GPS
- GPU
- Grammar
- Graph Analytics
- Graph Coloring
- Graph Database Benchmark
- Graph Databases
- Graph Generator
- Graph Motif Model
- Graph Partitioning
- Graph Reading Club
- Graph Traversal
- GraphBuilder
- GraphChi
- GraphDB
- Graphene
- GraphGL
- Graphic Processors
- Graphical Models
- Graphics
- Graphillion
- Graphipedia
- Graphity
- GraphLab
- GraphML
- GraphPack
- GraphQL
- Graphs
- GraphStream
- Graphviz
- GraphX
- GRASS GIS
- Greek
- Green-Mari
- Greenplum
- Gremlin
- Griswold
- Grok – Numenta
- groonga
- Groovy
- Group identical values
- Group Theory
- GT.M
- Guassian Processes
- Guided Exploration
- Gutenberg Corpus
- GWT
- H20
- Hacking
- Hadapt
- Hadoop
- Hadoop YARN
- HAIL
- Halide
- Hama
- Hank
- Harvard
- Hashing
- Hashtags
- Haskell
- HBase
- HBase Coprocessor
- HCatalog
- HCIR
- HDFS
- HDInsight
- Health care
- Heatmaps
- HEP – High Energy Physics
- Hermes
- Heroku
- Heterogeneous Data
- Heterogeneous Programming
- HFile
- Hibari
- Hibernate
- Hidden Markov Model
- Hierarchical Temporal Memory (HTM)
- Hierarchy
- Hieroglyphics
- High Dimensionality
- High Order Sequence Memory
- Hilbert Curve
- Hillary Clinton
- HipG
- History
- Hive
- Hive Plots
- Hive-on-Spark
- HiveQL
- Holographic Embeddings
- Holographic Lexicon
- Homogenization
- Homographs
- Homoiconic
- Homology
- Homonymous
- Homotopy
- HoneyPots
- Hortonworks
- Hosting
- Hoya
- HPC
- HPCC
- HSA
- HSearch
- HStreaming
- HTML
- HTML Data
- HTML5
- HTree
- HTTP Speed+Mobility
- Hudson
- Hue
- Human Cognition
- Human Computation
- Human Rights
- Human-Computer Interaction Lab (HCIL)
- Humanities
- Humor
- HWAB
- Hydra
- HyperANF
- HyperDex
- Hyperdimensional Computing
- Hyperedges
- Hypergraphs
- HyperLogLog
- Hypernodes
- Hypernotation
- Hyperspace
- Hypertable
- Hypertext
- Hystrix
- HyTime
- I/O
- IBM Cognos
- ice
- Ideation
- Identification
- Identifiers
- Identity
- IDH HBase
- iFinder
- igraph
- iiBench
- IIIF (International Image Interoperability Framework)
- Image Processing
- Image Recognition
- Image synthesis
- Image Understanding
- IMDb
- Immutable
- Impala
- Implicit Associations
- InChl
- Indexicality
- Indexing
- IndexTank
- Indirect Inference
- Induction
- Inductive Logic Programming (ILP)
- Inexact
- Inference
- InfiniDB
- InfiniteGraph
- InfluxDB
- Infographics
- Infogrid
- Informatics
- Information Architecture
- Information Exchange
- Information Field Theory
- Information Flow
- Information Geometry
- Information Integration
- Information Overload
- Information Retrieval
- Information Reuse
- Information Science
- Information Sharing
- Information Silos
- Information Theory
- Information Workers
- InnoDB
- Insertion
- INSPIRE
- Instagram
- Insurance
- Integers
- Integration
- Intellectual Property (IP)
- Intelligence
- Intent
- Interactomics
- Interface Research/Design
- Interoperability
- Intersection Type
- Intrusion Detection
- Invenio
- IOPS
- IoT – Internet of Things
- IRC
- iReport
- iSAX
- ISBN
- Isidorus
- Islam
- ISO/IEC
- ISSN
- IT
- izik
- Jaccard Similarity
- JanusGraph
- JAQL
- Jargon
- Jasondb
- Jaspersoft
- Java
- Java Annotations
- JavaRx
- Javascript
- JBoss
- JDBC
- JDBM
- Jedis
- Jena
- JgraphT
- jHepWork
- Jigsaw File System
- Jobs
- Joins
- Journalism
- JPL
- JQuery
- JRuby
- JSON
- JSONiq
- JTC1
- JTM
- Jubatus
- Julia
- JUNG
- JZY3D
- K-Means Clustering
- K-Nearest-Neighbors
- Kafka
- KairosDB
- Kalman Filter
- Kamala
- kaon – Knowledge Attribution Ontology
- Karmasphere
- Katta
- KDD
- Keras
- Kernel Methods
- Kettle
- Key-Key-Value Stores
- Key-Value Stores
- Keywords
- Kibana
- KIji Project
- KitaroDB
- Kite SDK
- KML
- Knime
- Knoema
- Knowledge
- Knowledge Base Population
- Knowledge Capture
- Knowledge Discovery
- Knowledge Discovery Toolkit (KDT)
- Knowledge Economics
- Knowledge Engineering
- Knowledge Graph
- Knowledge Management
- Knowledge Map
- Knowledge Networks
- Knowledge Organization
- Knowledge Representation
- Knowledge Retention
- Knowledge Sharing
- Knox Gateway
- KSQL
- Kuria
- Kylin
- L-wrappers
- Labcoat
- LangSec
- Language
- Language Design
- Language Pyramid (LaP)
- Lasp
- Latent Dirichlet Allocation (LDA)
- Latent Semantic Analysis
- Lavastorm Desktop Public
- Law
- Law – Sources
- Layerscape
- LCCN
- LCSH
- LDIF
- Leaks
- Learning
- Learning Classifier
- Legal Entity Identifier (LEI)
- Legal Informatics
- LegalRuleML
- Legends
- Lemur Project
- LessJunk.org
- leveldb
- LevelGraph
- Levenshtein Distance
- Lexical Analyzer
- Lexicon
- LexisNexis
- LFE Lisp Flavored Erlang
- Librarian/Expert Searchers
- Library
- Library Associations
- Library software
- Licensing
- LiDAR
- Life Sciences
- Lily
- Linear Optimization
- Linear Regression
- LingPipe
- Lingual
- Linguistic Metadata
- Linguistics
- Link-IPLSI
- Linked Data
- Linked Lists
- LINQ
- Linux OS
- Lisp
- Literature
- Literature-based Discovery
- Local Search
- Localization
- Location Data
- Lock-Free Algorithms
- LOD
- Log Analysis
- Log-linear analysis
- Logic
- logstash
- LOV
- LTM – Linear Topic Map Notation
- Lucene
- LucidWorks
- Lucy
- Luke
- Lustre
- Lux
- LVars
- Lyra
- LZ77
- Mac OS X
- Machine Learning
- MADlib
- Mahout
- Maiana
- Main Memory Map Reduce (M3R)
- MaJorToM
- MALLET
- Maltego
- Malware
- Manuscripts
- MapBox
- MapD
- MapGraph
- Mapillary
- Mapping
- MapR
- MapReduce
- MapReduce 2.0
- MapReduceMerge
- MapReduceXMT
- Maps
- MARC
- MARCXML
- MariaDB
- Marketing
- MarkLogic
- Markov Decision Processes
- Mashups
- Masstree
- Master Data Management
- Mathematica
- Mathematical Reasoning
- Mathematics
- Mathematics Indexing
- Mathics
- Matrix
- Maven
- MDM
- Meaning
- Measurement
- Mechanical Turk
- Media
- Medical Informatics
- Meld
- Membase
- Memcached
- Meme
- Memory
- Merge Construct
- Merge Sort
- Merging
- Merging Operators
- Merkle Trees
- Meronymy
- MeSH
- Mesos
- Messaging
- Meta-analysis
- Metabolomics
- Metadata
- Metaheuristics
- MetaMap
- MetaModel
- Metaphors
- Metaservices
- Metasploit
- Metathesaurus
- Metric Spaces
- MG4J
- Microdata
- Microformats
- Microscopy
- Microsoft
- Military
- Mind Maps
- Minimum Description Length
- Mio
- MIT
- Mizan
- ML-Flex
- MLBase
- Mobile Gov
- Modeling
- Molecular Graphs
- Monads
- Mondrian
- MongoDB
- MongoGraph
- Mongraph
- MonoTable
- Monte Carlo
- Morphlines
- Mortar
- MPEG-7
- MPI
- MPP
- MRQL
- MRUnit
- mSDA
- MuckRock
- Mule
- Multi-Core
- Multi-Database Mining
- Multi-Relational
- Multidimensional
- Multilingual
- Multimaps
- Multimedia
- Multiperspective
- Multisets
- MultiThreaded Graph Library (MTGL)
- Multivariate Statistics
- Multiview Learning
- MUMPS
- Museums
- Music
- Music Retrieval
- Mutual Information Classifiers
- MySQL
- Myth
- N-Body Simulation
- N-Gram
- N-Grams
- N1QL
- Naiad
- Naked Objects
- Named Entity Mining
- Named Scopes
- Names
- Namespace
- NAQ-tree
- Narrative
- NASA
- National Information Exchange Model NIEM
- National Security
- Natural Language Processing
- Navigation
- Nearest Neighbor
- Negation (Widdows)
- Neighborhood
- Neighbors
- NEM
- Neo4j
- Neo4j.rb
- Neo4jClient
- Neo4jD
- Neoclipse
- Neocons
- Neography
- Neovigator
- Nephele
- nessDB
- Nested Sets
- Netezza
- NetflixGraph
- Network Security
- Networks
- NetworkX
- Neural Information Processing
- Neural Networks
- Neuroinformatics
- New York
- News
- Newspeak
- NewSQL
- Ngram Viewer
- NHibernate
- NiFi
- NIFTY
- NIH
- NISO
- NIST
- NkBASE
- NLTK
- NOAA
- node-js
- NodeBox
- NodeGL
- NodeXL
- noms
- Non-Profit
- Nonlinear Models
- NonMetric Indexing
- Nonmetric Similarity
- Nonparametric
- Normalization
- NoSQL
- Novelty
- NSA
- NSF
- Numerical Analysis
- Numerical Information Field Theory
- Numpy
- NuoDB
- Nutch
- NVIDIA
- OAI
- Oceanography
- OCLC
- OCLC Number
- OCR
- Odata
- ODBC
- ODBMS
- oDesk
- OLAP
- Omnigator
- Onboarding
- OnionRunner
- Online Harassment
- Onomastics
- Ontogeny
- Ontolica
- Ontological Emptiness
- Ontology
- Ontopia
- Ontopoly
- OODT
- Oomap Loomap
- Oozie
- OPACS
- Open Access
- Open Babel
- Open Data
- Open Government
- Open Graph Database Protocol
- Open Relevance Project
- Open Science
- Open Semantic Framework
- Open Source
- Open Source Intelligence
- Open Street Map
- OpenCalais
- OpenCL
- OpenCV
- OpenFrameworks
- OpenGL
- OpenMeetings
- OpenNLP
- OpenOffice
- OpenRefine
- OpenSearch.org
- OpenShift
- OpenStack
- OpenStreetMap
- OpenTSDB
- OpenURL
- Operational Equivalence
- Operations
- Operations Research
- Opinions
- Optique
- Oracle
- Orange
- Organic Programming
- Organization
- Organizational Memory
- OrientDB
- ORM
- OS X
- osquery
- Outlier Detection
- Overlapping Sets
- OWL
- Oyster
- P2P
- PacketPig
- PACT
- PageRank
- Pajek
- Palaeography
- Palantir
- Panama Papers
- Pandas
- Panopticon
- Paradise Papers
- Parallel Programming
- Parallel Sets
- Parallela
- Parallelism
- ParalleX
- Parquet
- Parsers
- Parsing
- Partially Observable
- Particle Physics
- Patents
- Path Algebra
- Path Enumeration
- Pathfinders
- Pathology Informatics
- Pattern Compression
- Pattern Matching
- Pattern Recognition
- Paxos
- PCAT
- PDF
- Peer Review
- Pegasus
- Peirce
- Pentaho
- Perception
- Perceptron
- Percona Server
- Peregrine
- Performance
- Perl
- Persistent Search URLs
- Personal
- Personalization
- Persuasion
- Pervasive RushAnalyzer
- Pervasive Tracking
- Petuum
- PGStrom
- Pharmaceutical Research
- Philosophy
- Philosophy of Science
- Phishing for Leaks
- Phoebus
- Phoenix
- Photo-Reconnaissance
- PHP
- PHPTMAPI
- Phylogenetic Trees
- Physics
- Piccolo
- Pig
- Pinot
- Pipelines (Oil/Gas)
- Pipes
- PivotPaths
- Pivotviewer
- Plagiarism
- Plasma
- PLFS
- PLOS
- Plotly
- Plug Computers
- Podcasting
- Politics
- PolyBase
- Polyglot Persistence
- Polyhedrons
- Polymorphism
- Polysemy
- Polytopes
- PolyZoom
- POMDPs
- Porn
- POS
- POSIX Threads
- Postgre-XL
- PostgreSQL
- PostScript
- PouchDB
- Power Law Distributions
- PowerPivot
- Precision
- Predicate Dispatch
- Prediction
- Predictive Analytics
- Predictive Model Markup Language (PMML)
- Prefix Operators
- Pregel
- Presentation
- Preservation
- Presto
- Principal Component Analysis (PCA)
- Privacy
- Probabilistic Data Structures
- Probabilistic Database
- Probabilistic Graphical Models
- Probabilistic Programming
- Probabilistic Ranking
- Probability
- Probablistic Counting
- Probalistic Models
- Problem Solving
- Processing
- Processing.js
- Procrustes Transformation
- Profiling
- Programming
- Project Management
- Project Rhino
- Projection
- Prolog
- Proof Theory
- Proofing
- Propagator
- Properties
- Protégé
- Proteomics
- Protests
- Protovis
- Provenance
- Proxies
- Proxy Servers
- PSI
- PSPP
- Psychology
- PubChem
- Public Data
- Publishing
- PubMed
- PubMed Watcher
- Pulsar
- Punch Cards
- Purge
- Puzzles
- py2neo
- PyData
- Pyed Piper
- Pygmalion
- Python
- Python-Graph
- QFS
- QGIS
- Qi4j
- Qlikview
- QR Codes
- QSAR
- QuaaxTM
- quadrigram
- Quantitative Analysis
- Quantities
- Quantum
- Query Engine
- Query Expansion
- Query Language
- Query Rewriting
- Quid
- Quran
- R
- R Markdown
- R-Trees
- R2ML
- R2R
- R2RML
- Rabbithole
- Radare2
- Radio
- Random Forests
- Random Indexing
- Random Numbers
- Random Walks
- Randomness
- Ranganathan
- Rank Correlation
- Ranking
- Ransomware
- RapidMiner
- RaptorDB
- Raspbery-Pi
- Rattle
- RavenDB
- RDA
- RDB
- RDBMS
- RDF
- RDF Data Cube Vocabulary
- RDFa
- Reachability
- React
- Reading
- ReadMe
- Reasoning
- Recall
- Recognition
- Recommendation
- Record Linkage
- Record Resolution Systems (RSS)
- Red Hat
- Red Teaming
- Reddit
- Redis
- Rediscovery
- REEF
- Reegle – Thesaurus
- Reference
- Reflection
- Regex
- Regexes
- Regression
- Reidentification
- Reification
- Reinforcement Learning
- Related
- Relation Extraction
- Relationship Persistence
- Relaxation
- Relevance
- Religion
- Remote Method Invocation (RMI)
- Remote Sensing
- Replica Sets
- Replication
- Reporting
- Requirements
- Research Methods
- Researchers
- Reservoir Sampling
- Restricted Bolzmann Machines
- RESTx
- RethinkDB
- Retrieval
- Reverse Data Management
- Reverse Engineering
- Reviews
- Rewriting
- Rexster
- RexterGraph
- RFI-RFP
- RHadoop
- Rhetoric
- RHIPE
- Riak
- Riak CS
- Ripples
- rNews
- RocksDB
- Roles
- ROMA
- Rough Sets
- Roughness
- RSS
- RSS River Plugin
- Ruby
- RuleML
- Rust
- Rx
- Rya
- S4
- SaaS
- Sage
- Sail
- Sampling
- Samza
- SAP
- SAP HANA
- SAP MaxDB
- SAP Visual Intelligence
- SAX
- Saxon
- Scala
- Scalability
- ScalaNLP
- Scalaris
- Scalding
- Scale-Free
- ScaleGraph
- Schema
- Schema.org
- Scheme
- SciDB
- Science
- Scientific Computing
- Scikit-Learn
- SciVerse
- Scoobi
- Scope
- Scrunch
- SDA
- SDDC
- SDMX
- SDShare
- Search Algorithms
- Search Analytics
- Search Behavior
- Search Data
- Search Engines
- Search History
- Search Interface
- Search Potpourri
- Search Requirements
- Search Trees
- SearchBlox
- Searching
- SEC
- SecureGraph
- Security
- Sed
- Segmentation
- Sehrch.com
- Self Organizing Maps (SOMs)
- Self-organization
- Self-Organizing
- Semantator
- Semantic Annotation
- Semantic Colonialism
- Semantic Diversity
- Semantic Graph
- Semantic Inconsistency
- Semantic Overlay Network
- Semantic Search
- Semantic Vectors
- Semantic Web
- Semantics
- Semi-Structured Data
- Semi-structured Knowledge Bases
- Semiotics
- SENNA
- Sense
- Sensei
- Sensemaking
- Sentiment Analysis
- Sequence Classification
- Sequence Detection
- Serendipity
- Serengeti
- Sesame
- Set Intersection
- Set Reconciliation
- Sets
- sexism
- Shannon
- Shape
- SHARD
- Shard-Query
- Sharding
- SharePoint
- Shark
- Shell Scripting
- Shep
- Shodan
- SIGKDD
- Sigma.js
- Signal Processing
- Signal/Collect
- Silos
- Silverlight
- Similarity
- Similarity Retrieval
- Simple Web Semantics
- Simulated Annealing
- Simulations
- Sindice
- Singular Value Decomposition (SVD)
- Skepticism
- Sketchnotes
- Skip Graph
- Skip List
- SKOS
- Skytree
- SlamData
- Small World
- Smart-Phones
- snapLogic
- Snarl [Protocol]
- SNOBOL
- SNOMED
- Snooze
- Social Graphs
- Social Media
- Social Networks
- Social Sciences
- Socioeconomic Data
- Socrata Open Data Server
- Soft Sets
- Software
- Software Engineering
- Software Preservation
- Solandra
- Solarium
- Solr
- SolrCloud
- SolrMarc
- Sonification
- Sorting
- Sound
- SoundEx
- Space Data
- Spam
- SPAMS
- Spanner
- Spark
- SPARQL
- Sparse Data
- Sparse Distributed Representation SDR
- Sparse Image Representation
- Sparse Learning
- Sparse Matrices
- Spatial Data
- Spatial Index
- Spectral Clustering
- Spectral Evolution Model
- Spectral Feature Selection
- Spectral Graph Theory
- Speech Recognition
- Sphinx
- SpiderStore
- Splunk
- Spreadsheets
- Spring
- Spring Data
- Spring Hadoop
- Springer
- SQL
- SQL Server
- SQL-NoSQL
- SQLite
- Sqoop
- SSTable
- Stanbol
- Standards
- Stanford NLP
- Starcounter
- Stardog
- State Machine
- Statistical Core Vocabulary (scovo)
- Statistical Learning
- Statistically Improbable Phrases (SIPs)
- Statistics
- STEFFI
- Steganography
- Stemming
- STIG Database
- STINGER
- Storage
- Storm
- Storyboarding
- Stream Analytics
- Streambase
- Streams
- String Matching
- structr
- Structured Data
- Students
- Subgraphs
- Subject Authority
- Subject Experts
- Subject Headings
- Subject Identifiers
- Subject Identity
- Subject Locators
- Subject Recognition
- Subjective Logic
- Suffix Array
- Suffix Tree
- Summa
- Summarization
- Summify
- Summingbird
- SUMO
- Supercomputing
- Superpositioning
- Support Vector Machines
- Surrogate Learning
- Survey
- SVG
- SVO
- Swarms
- Symbol
- Synchronization
- Synonymy
- Systematic Literature Review (SLR)
- Systems Administration
- Systems Research
- Tableau
- Tables
- TabLinker/UnTabLinker
- Tabu Search
- Tabula
- Tachyon
- Tagging
- Tails
- Talend
- Tall Data
- Tamana
- Taxonomy
- Teaching
- Teiid
- Telecommunications
- Telegram App
- Templates
- Temporal Data
- Temporal Graph Database
- Temporal Semantic Analysis
- TensorFlow
- Tensors
- Tenzing
- Teradata
- Terminology
- Terrastore
- Terrorism
- Tessera
- TeX/LaTeX
- Text Analytics
- Text Coherence
- Text Corpus
- Text Encoding Initiative (TEI)
- Text Extraction
- Text Feature Extraction
- Text Mining
- Text Series
- Texts
- Textual Entailment
- Tez
- TF-IDF
- Thesaurus
- Theses/Dissertations
- Three.js
- Tika
- Time
- Time Series
- Timelines
- TimesOpen
- TinkerGraph
- TinkerPop
- Titan
- TMAPI
- TMCL
- TMCore
- TMDM
- TMQL
- TMQL4J
- TMRM
- Toad
- TokuDB
- Tokutek
- tolog
- Top-k Query Processing
- Topic Map Software
- Topic Map Systems
- Topic Maps
- Topic Models
- Topic Models (LDA)
- Topincs
- Topography
- Topological Data Analysis
- Topology
- Tor
- TPC-H
- Tracing
- Trails
- Training
- Translation
- Translation Memory
- Transparency
- Travel
- Traversal
- TREC
- Trees
- Tribes
- Tries
- Triggers
- Trinity
- TripleRush
- Triplestore
- Truffler
- TSearch
- Tuple MapReduce
- Tuple-Join MapReduce
- Tuples
- Turing Machines
- TUSTEP/TXSTEP
- Tweets
- Twister
- Twitter
- Typeahead Search
- Types
- Typography
- Ubigraph
- UIMA
- UMBEL
- UML
- Umlaut
- UMLS
- Uncategorized
- Uncertainty
- Unicode
- Union Type
- Unit Testing
- Units
- Unstructured Data
- Urika
- Usability
- Usage
- Use Cases
- User Targeting
- Usergrid
- Users
- UX
- Vagrant
- Vagueness
- Vault 7
- Vector Space Model (VSM)
- Vectors
- Vega
- Velocity
- Velox
- Venn Diagrams
- Verification
- Version Vectors
- Versioning
- Video
- Video Conferencing
- Viral
- Virtual Documents
- Virtual Machines
- Virtualization
- Virtuoso
- Virus
- Visual Query Language
- Visualization
- VIVO
- Vizigator
- Vocabularies
- Vocabulary Mismatch
- VocBench
- Voldemort
- VoltDB
- Volunteer
- Volunteers
- von Neumann Architecture
- VOStat
- Vowpal Wabbit
- Voyeur
- W3C
- Wakanda
- Wandora
- Wargames
- Warp
- Wavelet Transforms
- Wavelet Trees
- Wavii
- Weaponize Data
- Weaponized Open Data
- Weather Data
- Weave
- Web Analytics
- Web Applications
- Web Browser
- Web Conferencing
- Web History
- Web Scrapers
- Web Scraping
- Web Server
- Webcrawler
- WebGL
- WebGraph
- Weka
- Westlaw
- Whirr
- WhiteDB
- Whoosh
- Wibidata
- Wicked Problems
- Wide Data
- Wiki
- Wikidata
- Wikileaks
- WikiMaps
- Wikipedia
- Wikistream
- Windows Azure
- Windows Azure Marketplace
- Wolfram Language
- WolframAlpha
- Wonderdog
- Word Association
- Word Cloud
- Word Meaning
- Word Processing
- Wordmap
- WordNet
- Workflow
- WorldCat
- Writing
- WS-LDA
- WWT
- WWW
- X3DOM
- Xanadu
- Xapian
- XBRL
- XDATA
- XInclude
- XKOS
- XLDB
- XLink
- XML
- XML Data Clustering
- XML Database
- XML Query Rewriting
- XML Schema
- XNAT
- XPath
- XProc
- XQilla
- XQuery
- XSLT
- XTM
- Yahoo!
- YarcData
- YARS2
- YCSB
- Zing JVM
- Zoltan
- Zookeeper
- Zorba
- Zotero
- Zotonic
Search:
Archives:
- May 2020
- March 2020
- October 2019
- September 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
Meta:
- Log in
- RSS
- Comments RSS
- Valid XHTML
- XFN
- WP

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 11, 2011

tokenising the visible english text of common crawl

No Comments