Name

README - Read me file for the Xapian-Bigram-0.03 package


Synopsis

The codes in the package is a summary of bigrams counting of the text. In order to set up the word by word co-occurrence matrix, we need to calculate the bigrams of the text. Satanjeev Banerjee and Ted Pedersen wrote the original count.pl method. The issue with the original count.pl is with the increase of the window size and text, memory couldn't hold so many hashes. For example, a computer with 8GB memory can hold about 40 millions of hashes. If the text has more than 40 millions of bigrams, the system probably will be out of memory during counting the bigrams. Different computers have different memory size. When the text is huge and the window size is big, the bigrams frequency calculation need be finished in several goes.

We try to use Xapian to solve the problem. Xapian is an open source search engine library. (http://xapian.org/) By setting up the Xapian index database, we obtain each term's positions information and count the bigram frequency. Before you run the following Perl script, you need to install the xapian-core and Search::Xapian package. (http://xapian.org/download)


Author

Ying Liu, liux0395 at umn dot edu


bigramX.sh USAGE

bigramX.sh ties the codes together.

usage: ./bigramX.sh result-dir source-dir

result-dir: the result folder name.

source-dir: the source text folder name.

ex: $./bigramX.sh test_xin ./gigaword/xin_eng


bigramX.sh Description

bigramX.sh ties the following codes:

1. parseXML.pl: Parse gigaword English XML file

2. countX.pl: Use countX.pl find out the duplicated bigrams

3. scanBigrams.pl: Repeatedly scan the duplicated bigrams file and count the bigrams frequency

4. count.pl: Count the bigrams of the text.

5. compare.pl: Compare the results of count.pl and scanBigrams.pl

6. bigramX2.pl: Use bigramX2.pl count the bigrams frequency

7. bigramX3.pl: Use bigramX3.pl count the bigrams frequency

8. bigramX4.pl: Use bigramX4.pl count the bigrams frequency

9. buildXapian.pl: Parse the gigaword English, build the Xapian index database

10. bigramX1.pl: Use bigramX1.pl count the bigrams frequency

Ten files(folders) generated after the process in the result director:

1. xin.txt: output file of parsing gigaword English XML file. One document per line.

2. duplicate_bigrams: count the bigrams of xin.txt file

3. scan_bigrams: unique bigrams and its frequency

4. count_bigrams: unique bigrams generated by count.pl

5. ./xapian_index folder: Xapian index database of buildXapian.pl

6. X1_bigrams: unique bigrams and its frequency of bigramX1.pl

7. X2_bigrams: unique bigrams and its frequency of bigramX2.pl

8. X3_bigrams: unique bigrams and its frequency of bigramX3.pl

9. X4_bigrams: unique bigrams and its frequency of bigramX4.pl

10. log.txt: log file record the time useage and results of each step

11. 1 & 2 folder: Xapian index database generated by bigramX2.pl


Package Description

1. count.pl:

Count the frequency of Ngrams in text.

2. countX.pl:

Modify the original count.pl file and output the duplicate bigrams into a file. When the text is huge, the hash of the original count.pl couldn't hold all the ngrams it meets and keeps counting the ngram frequency. This problem in count.pl has been resolved in countX.pl by printing out every ngram it finds. usage: perl countX.pl --set_freq_combo comboX outputFile inputFile(output file of parseXML.pl)

3. parseXML.pl:

Parse the gigaword English, one document per line. Only remove the '%' when parsing. usage: perl parseXML.pl sourceFolder outputFile

4. buildXapian.pl:

Parse the gigaword English, build the Xapian index database(3 steps) for bigramX1.pl usage: perl buildXapian.pl xapian_index gigaword/

5. scanBigrams.pl:

Repeatedly scan the duplicate bigrams file which is the output file of countX.pl, output the unique bigram and its frequency. It uses Perl hash. for a 239,011,894 duplicate bigrams, it took about 7 hours to find and sort the unique bigrams. It actually took about 3-4 hours find all the bigrams and the rest time was for unnecessary loops. For a cleaned bigrams file, it won't need 256 loops. usage: perl scanBigrams.pl outputFile inputBigramFile(output file of countX.pl) I recommend this method when the input duplicate bigrams file is huge.

6. compare.pl:

Compare the results of count.pl and scanBigrams.pl. scanBigrams.pl writes out the bigrams results in the alphabetic order, and counts.pl writes out the bigrams results without order. The comparing process is very straightforward. It reads in the results of count.pl and hold the bigrams in a hash. And then read in the results of scanBigrams.pl and compares the bigrams frequency. If the bigrams frequency are the same, then delete this bigram in the hash. Finally, if the hash is empty, then the results of count.pl and scanBigrams.pl are the same.

7. bigramX1.pl:

Takes as input one Xapian database and calculate the bigram frequency for the whole corpus. It finds ALL combinations of two terms within the window (3 steps). only works for small documents, speed is extremely slow. usage: perl bigramX1.pl outputFile inputXapianDatabase

8. bigramX2.pl:

Takes duplicate bigram file as input, build SEVERAL Xapian databases (one document per database) and then calculate the bigram frequency for the whole corpus. It can handle big documents it took about 20 minutes to build the Xapian index database. Maybe, it needs 40 hours to find the unique bigrams. Right now, it is still running. usage: perl bigramX2.pl outputFile inputBigramFile(output file of countX.pl)

9. bigramX3.pl:

Takes duplicate bigram file as input, build SEVERAL documents within ONE Xapian databases and then calculate the bigram frequency for the whole corpus. only works for small documents. (it can input bigger file than bigramX4.pl) usage: perl bigramX3.pl outputFile inputBigramFile(output file of countX.pl)

10. bigramX4.pl:

Takes duplicate bigram file as input, build ONE document within ONE Xapian databases and then calculate the bigram frequency for the whole corpus. only works for small documents. usage: perl bigramX3.pl outputFile inputBigramFile(output file of countX.pl)

11. comboX:

An empty file for countX.pl

12. README.txt:

This file.

13. ./gigaword:

contains gigaword English XML files.

./afp_eng/afp_eng_199405: contains 240 documents of afp_eng on 05/12/1994

./xin_eng/xin_eng_199501: contains 449 documents of xin_eng from 01/01/1995 to 01/03/1995

./cna_eng/cna_eng_199709: contains 522 documents of cna_eng in 09/1995


Copyright

Copyright (C) 2009, Ying Liu.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.