Package pygar.demo0P

Class CompareFiles

java.lang.Object
pygar.demo0P.CompareFiles

public class CompareFiles
extends java.lang.Object
This class contains methods to locate similar documents contained in two files. The similarity is based on word counts that are contained in the XML document. In fact, this class does not expect the actual document. The body of the document is never sent to the BAN in Demonstration 0; instead, the team members send the accession number and word counts for each document.
  • Field Summary

    Fields 
    Modifier and Type Field Description
    java.util.List<java.lang.String> accessionList1  
    java.util.List<java.lang.String> accessionList2  
    java.util.Set<java.lang.String> accessionSet1  
    java.util.Set<java.lang.String> accessionSet2  
    java.util.Map<java.lang.String,​java.util.Map<java.lang.String,​pygar.demo0P.CompareFiles.DocStruct>> documents  
    static pygar.demo0P.CompareFiles.DrawFrame frame1  
    pygar.demo0P.CompareFiles.ImageComponent imageComponent  
    java.awt.image.MemoryImageSource imageSource  
    int ndocs1  
    int ndocs2  
    int[] pix  
    int reportedResults  
    java.util.List<pygar.demo0P.CompareFiles.ResultStruct> resultList  
    java.util.List<java.lang.String> titleList1  
    java.util.List<java.lang.String> titleList2  
    java.util.Set<java.lang.String> titleSet1  
    java.util.Set<java.lang.String> titleSet2  
  • Constructor Summary

    Constructors 
    Constructor Description
    CompareFiles​(java.lang.String fileDir)  
  • Method Summary

    Modifier and Type Method Description
    void compareDocuments​(java.util.Map<java.lang.String,​pygar.demo0P.CompareFiles.DocStruct> doc1, java.util.Map<java.lang.String,​pygar.demo0P.CompareFiles.DocStruct> doc2)
    Compare documents for similarity based on word counts.
    void computeFreqs()
    calculate the frequency of the words in the aggregate of all documents
    void computeProfileFreqs()
    compute word frequency profiles for individual documents in a set of documents.
    void displayResults()  
    void ingestFile​(java.lang.String teamName, java.lang.String dirName, java.lang.String fileName)  
    void initDisplay()  
    static void main​(java.lang.String[] args)  
    void setBAN​(BAN thisBAN)
    save a pointer of the BAN instance so that we can update the progress bar in the GUI panel.
    void sumCounts()
    Combine the word counts from each of a series of document.
    void trimSums​(int ndelete)
    reduce the size of the word list by eliminating the most frequent words.
    void writeResultsXML​(java.lang.String dir, java.lang.String file, int nresults, java.lang.String entity1, java.lang.String entity2)  

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • documents

      public java.util.Map<java.lang.String,​java.util.Map<java.lang.String,​pygar.demo0P.CompareFiles.DocStruct>> documents
    • resultList

      public java.util.List<pygar.demo0P.CompareFiles.ResultStruct> resultList
    • reportedResults

      public int reportedResults
    • titleList1

      public java.util.List<java.lang.String> titleList1
    • titleSet1

      public java.util.Set<java.lang.String> titleSet1
    • titleList2

      public java.util.List<java.lang.String> titleList2
    • titleSet2

      public java.util.Set<java.lang.String> titleSet2
    • accessionSet1

      public java.util.Set<java.lang.String> accessionSet1
    • accessionList1

      public java.util.List<java.lang.String> accessionList1
    • accessionSet2

      public java.util.Set<java.lang.String> accessionSet2
    • accessionList2

      public java.util.List<java.lang.String> accessionList2
    • ndocs1

      public int ndocs1
    • ndocs2

      public int ndocs2
    • imageSource

      public java.awt.image.MemoryImageSource imageSource
    • imageComponent

      public pygar.demo0P.CompareFiles.ImageComponent imageComponent
    • pix

      public int[] pix
    • frame1

      public static pygar.demo0P.CompareFiles.DrawFrame frame1
  • Constructor Details

    • CompareFiles

      public CompareFiles​(java.lang.String fileDir)
  • Method Details

    • setBAN

      public void setBAN​(BAN thisBAN)
      save a pointer of the BAN instance so that we can update the progress bar in the GUI panel.
      Parameters:
      thisBAN -
    • ingestFile

      public void ingestFile​(java.lang.String teamName, java.lang.String dirName, java.lang.String fileName) throws java.io.FileNotFoundException, DocumentError
      Throws:
      java.io.FileNotFoundException
      DocumentError
    • sumCounts

      public void sumCounts()
      Combine the word counts from each of a series of document.
    • trimSums

      public void trimSums​(int ndelete)
      reduce the size of the word list by eliminating the most frequent words.
      Parameters:
      ndelete - number of words to delete
    • computeFreqs

      public void computeFreqs()
      calculate the frequency of the words in the aggregate of all documents
    • computeProfileFreqs

      public void computeProfileFreqs()
      compute word frequency profiles for individual documents in a set of documents. Frequencies are computed for all the words on the global word list, even if they do not appear in a particular document. Note also that the frequency is relative to all words in the document that are also on the global word list. Thus, common words do not count in the total, nor do we calculate a frequency for them. Steps: 1. total all words across the individual document except for the very common words. 2. for all words in the local list of words, which are also on the global list, compute the frequency as the word count divided by the words in the document. 3. for all words on the global list, verify if there is a frequency in the profile. if none is found, set the frequency for that word to zero.
    • compareDocuments

      public void compareDocuments​(java.util.Map<java.lang.String,​pygar.demo0P.CompareFiles.DocStruct> doc1, java.util.Map<java.lang.String,​pygar.demo0P.CompareFiles.DocStruct> doc2)
      Compare documents for similarity based on word counts. This is not an ideal search method, but it is sufficient to illustrate document matching. The comparison is actually between two files each containing a series of short documents labeled by accession and title.
      Parameters:
      doc1 -
      doc2 -
    • writeResultsXML

      public void writeResultsXML​(java.lang.String dir, java.lang.String file, int nresults, java.lang.String entity1, java.lang.String entity2) throws java.io.IOException, javax.xml.stream.XMLStreamException
      Throws:
      java.io.IOException
      javax.xml.stream.XMLStreamException
    • initDisplay

      public void initDisplay() throws java.lang.InterruptedException, java.lang.reflect.InvocationTargetException
      Throws:
      java.lang.InterruptedException
      java.lang.reflect.InvocationTargetException
    • displayResults

      public void displayResults()
    • main

      public static void main​(java.lang.String[] args) throws DocumentError, java.lang.InterruptedException, java.lang.reflect.InvocationTargetException, java.io.IOException, javax.xml.stream.XMLStreamException
      Parameters:
      args -
      Throws:
      DocumentError
      java.lang.reflect.InvocationTargetException
      java.lang.InterruptedException
      javax.xml.stream.XMLStreamException
      java.io.IOException