Where can I get a diverse set of sample text? As for finding a large text, note that the frequency would be biased to the type of text. For example, if you analyze addresses you will get different results from analyzing newspaper stories. You can convert PDF to text using the File-->Save As Text in the ADOBE PDF reader. Sample insurance portfolio (download.csv file) The sample insurance file contains 36,634 records in Florida for 2012 from a sample company that implemented an agressive growth plan in 2012. There are total insured value (TIV) columns containing TIV from 2011 and 2012, so this dataset is great for testing out the comparison feature.
This tool performs reservoir sampling (Vitter, 'Random sampling with a reservoir'; cf. http://dx.doi.org/10.1145/3147.3165 and also: http://en.wikipedia.org/wiki/Reservoir_sampling) on very large text files that are delimited by newline characters. Sampling can be done with or without replacement. The approach used in this application reduces the typical memory usage issue with reservoir sampling by storing a pool of byte offsets to the start of each line, instead of the line elements themselves, thus allowing much larger sample sizes.
In its current form, this application offers a few advantages over common shuf
-based approaches:
- On small k, it performs roughly 2.25-2.75x faster than
shuf
in informal tests on OS X and Linux hosts. - It uses much less memory than the usual reservoir sampling approach that stores a pool of sampled elements; instead,
sample
stores the start positions of sampled lines (8 bytes per line). - Using less memory gives
sample
an advantage overshuf
for whole-genome scale files, helping avoidshuf: memory exhausted
errors. For instance, a 2 GB allocation would allow a sample size up to ~268M random elements (sampling without replacement).
The sample
tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, using a Mersenne Twister to generate uniformly random values, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use mmap
routines to gain random access to data in the (regular) input file on both passes.
The benefit that mmap
provided was significant. For comparison purposes, we also add a --cstdio
option to test the performance of the use of standard C I/O routines (fseek()
, etc.); predictably, this performed worse than the mmap
-based approach in all tests, but timing results were about identical with gshuf
on OS X and still an average 1.5x improvement over shuf
under Linux.
The sample
tool can be used to sample from any text file delimited by newline characters (BED, SAM, VCF, etc.).
By adding the --preserve-order
option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS sort-bed
— which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates — the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.
By omitting the sample size parameter, the sample
tool can shuffle the entire file. This tool can be used to shuffle files that shuf
has trouble with; however, it currently operates slower than shuf
, where shuf
can be used. We recommend use of shuf
for shuffling an entire file, or specifying the sample size (up to the line count, if known ahead of time), when possible.
One downside at this time is that sample
does not process a standard input stream; the input must be a regular file.
A text file containing over 466k English words.
Sample Large Text File Download Free
While searching for a list of english words (for an auto-complete tutorial)I found: http://stackoverflow.com/questions/2213607/how-to-get-english-language-word-database which refers to http://www.infochimps.com/datasets/word-list-350000-simple-english-words-excel-readable (archived).
No idea why infochimps put the word list inside an excel (.xls) file.
I pulled out the words into a simple new-line-delimited text file.Which is more useful when building apps or importing into databases etc.
Copyright still belongs to them.
Files you may be interested in:
Sample Large Text File Downloads
- words.txt contains all words.
- words_alpha.txt contains only [[:alpha:]] words (words that only have letters, no numbers or symbols). If you want a quick solution choose this.
- words_dictionary.json contains all the words from words_alpha.txt as json format.If you are using Python, you can easily load this file and use as a dictionary for faster performance. All the words are assigned with 1 in the dictionary.See read_english_dictionary.py for example usage.