Parallel Data Generation Framework
From the webpage:
The Parallel Data Generation Framework (PDGF) is a generic data generator for database benchmarking. Its development started at the University of Passau at the group of Prof. Dr. Harald Kosch.
PDGF was designed to take advantage of today’s multi-core processors and large clusters of computers to generate large amounts of synthetic benchmark data very fast. PDGF uses a fully computational approach and is a pure Java implementation which makes it very portable.
I mention this to ask if you are aware of methods for generating unstructured text with known characteristics such as the number of entities and their representations in the data set?
A “natural” dataset, say blog posts or emails, etc., can be probed to determine its semantic characteristics but I am interested in generation of a dataset with known semantic characteristics.
Thoughts?
I first saw this in a tweet by Stefano Bertolo.