Name

README - Read me file for the Co-occurrence-Matrix-0.02 package


Synopsis

The codes in the package is a summary of building co-occurrence matrix. In order to set up the word by word co-occurrence matrix, we need to calculate the bigrams of the text. Satanjeev Banerjee and Ted Pedersen wrote the original count.pl method. The issue with the original count.pl is with the increase of the window size and text, memory couldn't hold so many hashes. For example, a computer with 8GB memory can hold about 40 millions of hashes. If the text has more than 40 millions of bigrams, the system probably will be out of memory during counting the bigrams. Different computers have different memory size. When the text is huge and the window size is big, the bigrams frequency calculation need be calculated by spliting the duplicate bigrams files, and then sort each bigrams file and merger those files to get the final unique bigrams file. And then, co-occurrence matrix is built by the unique bigram file.


Author

Ying Liu, liux0395 at umn dot edu


run.sh USAGE

run.sh ties the codes together.

usage: ./run.sh


run.sh Description

run.sh ties the following codes:

1. countX.pl: Write out duplicate bigrams into one file

2. huge-sort.pl: Sort each duplicated bigrams file

3. huge-merge.pl: Merge each sorted bigrams file into one unique bigrams file

4. cooccurrence.pl: Build the co-occurrence matrix and the index file

5. huge-vector.pl: Build the co-occurrence matrix and the index file

Four files generated after the process in the result director:

1. bigrams.txt: the duplicate bigrams file

2. ./output/merge.X: the final unique bigrams file

3. ./output/index: index file for each unique word of the text

4. ./output/matrix: co-occurrence matrix

5. ./output/testVector: test vector output


Package Description

1. countX.pl:

Modify the original count.pl file and output the duplicate bigrams into a file. When the text is huge, the hash of the original count.pl couldn't hold all the ngrams it meets and keeps counting the ngram frequency. This problem in count.pl has been resolved in countX.pl by printing out every ngram it finds. usage: perl countX.pl --set_freq_combo comboX outputFile inputFile(output file of parseXML.pl)

2. huge-sort.pl:

Obtain the unique bigrams for the input file and sort them in the alphabet order. usage: perl huge-sort.pl sourceFolder

4. huge-merge.pl:

Merge each order bigrams files into one unique bigrams file. usage: perl huge-merge.pl sourceFolder

5. cooccurrence.pl:

Build the co-occurrency matrix by input the unique bigrams file usage: perl cooccurrence.pl index matrix source

6. huge-vector.pl:

calculate the vector from co-occurrence matrix and the index file usage: perl huge-vector.pl index matrix vector

7. README.txt:

This file.

8. 10000.txt:

contains 10000 documents of xin_eng gigaword English XML file.


Copyright

Copyright (C) 2009, Ying Liu.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.