Package pygar.demo0P
Class CompareFiles
java.lang.Object
pygar.demo0P.CompareFiles
public class CompareFiles
extends java.lang.Object
This class contains methods to locate similar documents contained in two files.
The similarity is based on word counts that are contained in the XML document.
In fact, this class does not expect the actual document. The body of the document
is never sent to the BAN in Demonstration 0; instead, the team members send the
accession number and word counts for each document.
-
Field Summary
Fields Modifier and Type Field Description java.util.List<java.lang.String>
accessionList1
java.util.List<java.lang.String>
accessionList2
java.util.Set<java.lang.String>
accessionSet1
java.util.Set<java.lang.String>
accessionSet2
java.util.Map<java.lang.String,java.util.Map<java.lang.String,pygar.demo0P.CompareFiles.DocStruct>>
documents
static pygar.demo0P.CompareFiles.DrawFrame
frame1
pygar.demo0P.CompareFiles.ImageComponent
imageComponent
java.awt.image.MemoryImageSource
imageSource
int
ndocs1
int
ndocs2
int[]
pix
int
reportedResults
java.util.List<pygar.demo0P.CompareFiles.ResultStruct>
resultList
java.util.List<java.lang.String>
titleList1
java.util.List<java.lang.String>
titleList2
java.util.Set<java.lang.String>
titleSet1
java.util.Set<java.lang.String>
titleSet2
-
Constructor Summary
Constructors Constructor Description CompareFiles(java.lang.String fileDir)
-
Method Summary
Modifier and Type Method Description void
compareDocuments(java.util.Map<java.lang.String,pygar.demo0P.CompareFiles.DocStruct> doc1, java.util.Map<java.lang.String,pygar.demo0P.CompareFiles.DocStruct> doc2)
Compare documents for similarity based on word counts.void
computeFreqs()
calculate the frequency of the words in the aggregate of all documentsvoid
computeProfileFreqs()
compute word frequency profiles for individual documents in a set of documents.void
displayResults()
void
ingestFile(java.lang.String teamName, java.lang.String dirName, java.lang.String fileName)
void
initDisplay()
static void
main(java.lang.String[] args)
void
setBAN(BAN thisBAN)
save a pointer of the BAN instance so that we can update the progress bar in the GUI panel.void
sumCounts()
Combine the word counts from each of a series of document.void
trimSums(int ndelete)
reduce the size of the word list by eliminating the most frequent words.void
writeResultsXML(java.lang.String dir, java.lang.String file, int nresults, java.lang.String entity1, java.lang.String entity2)
-
Field Details
-
documents
public java.util.Map<java.lang.String,java.util.Map<java.lang.String,pygar.demo0P.CompareFiles.DocStruct>> documents -
resultList
public java.util.List<pygar.demo0P.CompareFiles.ResultStruct> resultList -
reportedResults
public int reportedResults -
titleList1
public java.util.List<java.lang.String> titleList1 -
titleSet1
public java.util.Set<java.lang.String> titleSet1 -
titleList2
public java.util.List<java.lang.String> titleList2 -
titleSet2
public java.util.Set<java.lang.String> titleSet2 -
accessionSet1
public java.util.Set<java.lang.String> accessionSet1 -
accessionList1
public java.util.List<java.lang.String> accessionList1 -
accessionSet2
public java.util.Set<java.lang.String> accessionSet2 -
accessionList2
public java.util.List<java.lang.String> accessionList2 -
ndocs1
public int ndocs1 -
ndocs2
public int ndocs2 -
imageSource
public java.awt.image.MemoryImageSource imageSource -
imageComponent
public pygar.demo0P.CompareFiles.ImageComponent imageComponent -
pix
public int[] pix -
frame1
public static pygar.demo0P.CompareFiles.DrawFrame frame1
-
-
Constructor Details
-
CompareFiles
public CompareFiles(java.lang.String fileDir)
-
-
Method Details
-
setBAN
save a pointer of the BAN instance so that we can update the progress bar in the GUI panel.- Parameters:
thisBAN
-
-
ingestFile
public void ingestFile(java.lang.String teamName, java.lang.String dirName, java.lang.String fileName) throws java.io.FileNotFoundException, DocumentError- Throws:
java.io.FileNotFoundException
DocumentError
-
sumCounts
public void sumCounts()Combine the word counts from each of a series of document. -
trimSums
public void trimSums(int ndelete)reduce the size of the word list by eliminating the most frequent words.- Parameters:
ndelete
- number of words to delete
-
computeFreqs
public void computeFreqs()calculate the frequency of the words in the aggregate of all documents -
computeProfileFreqs
public void computeProfileFreqs()compute word frequency profiles for individual documents in a set of documents. Frequencies are computed for all the words on the global word list, even if they do not appear in a particular document. Note also that the frequency is relative to all words in the document that are also on the global word list. Thus, common words do not count in the total, nor do we calculate a frequency for them. Steps: 1. total all words across the individual document except for the very common words. 2. for all words in the local list of words, which are also on the global list, compute the frequency as the word count divided by the words in the document. 3. for all words on the global list, verify if there is a frequency in the profile. if none is found, set the frequency for that word to zero. -
compareDocuments
public void compareDocuments(java.util.Map<java.lang.String,pygar.demo0P.CompareFiles.DocStruct> doc1, java.util.Map<java.lang.String,pygar.demo0P.CompareFiles.DocStruct> doc2)Compare documents for similarity based on word counts. This is not an ideal search method, but it is sufficient to illustrate document matching. The comparison is actually between two files each containing a series of short documents labeled by accession and title.- Parameters:
doc1
-doc2
-
-
writeResultsXML
public void writeResultsXML(java.lang.String dir, java.lang.String file, int nresults, java.lang.String entity1, java.lang.String entity2) throws java.io.IOException, javax.xml.stream.XMLStreamException- Throws:
java.io.IOException
javax.xml.stream.XMLStreamException
-
initDisplay
public void initDisplay() throws java.lang.InterruptedException, java.lang.reflect.InvocationTargetException- Throws:
java.lang.InterruptedException
java.lang.reflect.InvocationTargetException
-
displayResults
public void displayResults() -
main
public static void main(java.lang.String[] args) throws DocumentError, java.lang.InterruptedException, java.lang.reflect.InvocationTargetException, java.io.IOException, javax.xml.stream.XMLStreamException- Parameters:
args
-- Throws:
DocumentError
java.lang.reflect.InvocationTargetException
java.lang.InterruptedException
javax.xml.stream.XMLStreamException
java.io.IOException
-