README - Read me file for the Co-occurrence-Matrix-0.02 package


The codes in the package is a summary of building co-occurrence matrix. In order to set up the word by word co-occurrence matrix, we need to calculate the bigrams of the text. Satanjeev Banerjee and Ted Pedersen wrote the original method. The issue with the original is with the increase of the window size and text, memory couldn't hold so many hashes. For example, a computer with 8GB memory can hold about 40 millions of hashes. If the text has more than 40 millions of bigrams, the system probably will be out of memory during counting the bigrams. Different computers have different memory size. When the text is huge and the window size is big, the bigrams frequency calculation need be calculated by spliting the duplicate bigrams files, and then sort each bigrams file and merger those files to get the final unique bigrams file. And then, co-occurrence matrix is built by the unique bigram file.


Ying Liu, liux0395 at umn dot edu USAGE ties the codes together.

usage: ./ Description ties the following codes:

1. Write out duplicate bigrams into one file

2. Sort each duplicated bigrams file

3. Merge each sorted bigrams file into one unique bigrams file

4. Build the co-occurrence matrix and the index file

5. Build the co-occurrence matrix and the index file

Four files generated after the process in the result director:

1. bigrams.txt: the duplicate bigrams file

2. ./output/merge.X: the final unique bigrams file

3. ./output/index: index file for each unique word of the text

4. ./output/matrix: co-occurrence matrix

5. ./output/testVector: test vector output

Package Description


Modify the original file and output the duplicate bigrams into a file. When the text is huge, the hash of the original couldn't hold all the ngrams it meets and keeps counting the ngram frequency. This problem in has been resolved in by printing out every ngram it finds. usage: perl --set_freq_combo comboX outputFile inputFile(output file of


Obtain the unique bigrams for the input file and sort them in the alphabet order. usage: perl sourceFolder


Merge each order bigrams files into one unique bigrams file. usage: perl sourceFolder


Build the co-occurrency matrix by input the unique bigrams file usage: perl index matrix source


calculate the vector from co-occurrence matrix and the index file usage: perl index matrix vector

7. README.txt:

This file.

8. 10000.txt:

contains 10000 documents of xin_eng gigaword English XML file.


Copyright (C) 2009, Ying Liu.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.