Mining Historical Archives for NearDuplicate Figures Thanawin Rakthanmanon,

Mining Historical Archives for NearDuplicate Figures Thanawin Rakthanmanon,

Mining Historical Archives for NearDuplicate Figures Thanawin Rakthanmanon, Qiang Zhu, and Eamonn J. Keogh Figure 1. Two plates from 19th-century texts on Diatoms. Plate 6 of [15] and plate 5 of [20]. middle) A zoom-in of the same species, Biddulphia alternans appearing in both texts. Biddulphia alternans (J.W. Bailey) Van Heurck Synonym(s): Triceratium alternans J.W. Bailey Image source:digitised drawing Literature reference: J. Ralfs in Pritchard: A History of Infusoria (1861) , plate 6, fig. 21a View type: Valve view Scale: Image height equivalent 59m; Image width equivalent 76m Biddulphia alternans (J.W. Bailey) Van Heurck Synonym(s): Triceratium alternans J.W. Bailey Image source:digitised drawing Literature reference: W. Smith: British Diatomaceae Vol.1 (1853) , plate 5, fig. 45 View type: Valve view Scale: Image height equivalent 53m; Image width

equivalent 57m Figure 2. left) A figure from page 7 of [6], a 1915 text on peerage. The original text is monochrome. right) A figure from page 109 of [3], an 1858 text on honors and decorations. [3] Burke, J. B. 1858. Book of Orders of Knighthood and Decorations of Honour of all Nations, London: Hurst and Blackett. [6] Dod, C. R. and Dod, R. P. 1915. Dods Peerage, Baronetage and Knightage of Great Britain and Ireland for 1915, London: Simpkin, Marshall, Hamilton, Kent. ltd. Figure 3. Examples of texts with holes. Figure 4. The distance measure we use is offset-invariant, so the distance between any pair of windows, left, center or right above, is exactly zero. This simple fact can be exploited to greatly reduce the search space of motif discovery. Since a pattern from another book that matches one of the above with a distance X must match all with distance X, we only need to include any one of the above in our search. Figure 5. An illustration of our notation. Here the document D consists of two pages, separated by null values. Intuitively we expect the T shape in window Wa to match the shape shown in Wb. However, note that the trivial

matching pair of Wc and Wd (also pair We and Wf) are actually more similar, and need to be excluded to prevent pathological results. Wb Wa =W20,3 =W3,2 1 D 0 -1 Wc Wd We Wf Figure 6. An illustration of a pathological solution to finding the top two motif pairs between two century-old texts. top) The desirable solution finds the crescent and label (rotated E). bottom) A redundant and undesirable solution that we must explicitly exclude is finding one pattern (the label) twice.

Figure 7. A) Two figures from table 16 of a 1907 text on Native American rock art [13] (one image recolored red for clarity). B) No matter how we shift these two figures, no more than 16% of their pixels overlap. C) Downsampled versions of the figures share 87.2% of their pixels (D). A B C D Figure 8. A) If we randomly choose some locations (masks) on the underlying bitmap grid on which the two figures (B) shown in Figure 7 lie, and then remove those pixels from the figures, then the distance between the edited figures (C) can only stay the same or decrease. Several random attempts at removing of the pixels in the two figures eventually produced two identical edited figures (D). A B

C Mask template D Figure 9. The summation of the number of black pixels in windows. Only windows corresponding to peaks above the threshold (the red line) need to be tested. The arrows show the center position of six potential windows. Figure 10. Samples showing the interclass variability in the hand-drawn datasets. left) Samples from the music datasets. right) Samples from the architectural dataset. Figure 11. left) Two typical pages from Californian petroglyphs [21]. right) Two typical pages from [13]. Note that the minor artifacts are from the original Google scanning. [13] Koch-Grnberg, T. 1907. Sudamerikanische Felszeichnungen (South American petroglyphs), Berlin, E. Wasmuth A-G. [21] Smith, G. A. and Turner, W. G. 1975. Indian Rock Art of Southern California with Selected Petroglyph Catalog, San Bernardino County, Museum Association. Figure 12. Six random motif pairs from the top fifty pairs created by joining the two texts [13] and [21]. Note that these results suggest

that our algorithm is robust to line thickness, solid vs. hollow shapes, and various other distortions. [13] Koch-Grnberg, T. 1907. Sudamerikanische Felszeichnungen (South American petroglyphs), Berlin, E. Wasmuth A-G. [21] Smith, G. A. and Turner, W. G. 1975. Indian Rock Art of Southern California with Selected Petroglyph Catalog, San Bernardino County, Museum Association. Figure 13. The top two inter-book motifs discovered when linking a 1921 text, British Heraldry [4] (left), with a 1909 text, English Heraldic Book-Stamps, Figured and Described [5] (center), and (right). [4] Davenport, C. 1912. British Heraldry, Methuen. [5] Davenport, C. 1909. English heraldic book-stamps, figured and described, London: Archibald Constable. ltd. Figure 14. A zoom-in of the motifs discovered in Figure 13. Figure 15. left) The 14-segment template used to create characters. We can turn on/off each segment independently to generate a vast alphabet. middle) An example of a page which is generated from the process. right) A page of the book after adding polynomial distortion (top half), and Gaussian noise with mean 0 and variance 0.10 (bottom half).

Figure 16. Time to discover motifs in books of increasing size. Our algorithm can find a motif in 512 pages in 5.5 minutes and 2048 pages in 33 minutes. (inset) As a sanity check we confirmed that the discovered motifs are plausible, as here (noise removed for clarity). Execution Time (sec) 2000 Scalability Polynomial distortion 1500 No distortion 1000 Sample Motifs 500 0 1 2

4 8 16 32 64 Number of Pages 128 256 512 1024 2048 Execution Time (sec) Figure 17. Effect of Gaussian noise. Our algorithm can handle

significant amounts of noise. An example of a page containing noise at var=0.10 is shown in Figure 15.right. 250 100 Var = 0.20 Effect of Gaussian Noise Var = 0.15 Var = 0.10 Var = 0.05 Var = 0.01 No noise 10 1 0 1 2 4 8

16 Number of Pages 32 64 128 256 Figure 18. The total execution time of three search algorithms: an exact motif search, an exact motif search on just the potential windows, and our algorithm ApproxMotif. 3.0 x 10 4 Execution Time (sec) 2.5 2.0

Exact search(all Windows) 1.5 Exact search(potential Windows) 1.0 0.5 ApproxMotif 0 1 2 4 8 32 16 Number of Pages 64

128 256 512 We compared the running times of: 1. Exact motif search over the entire document by applying best known motif discovery technique in [27] 2. Exact motif search over just the potential windows 3. Our proposed algorithm, ApproxMotif [27] Mueen, A. and Keogh, E. J., and Shamlo, N. B. 2009. Finding Time Series Motifs in Disk-Resident Data. ICDM, 367-376. Figure 19. The effect of parameters on our algorithm. We test on artificial books with polynomial distortion and each result is averaged over ten runs. The bold/red line represents the parameters learned from just the first two pages. 400 600 C A

400 Masking Ratio Number of Iterations 200 Execution Time (sec) 200 0 0 1 400 2 4 8 16

32 64 128 256 512 1 400 B 2 200 0 0 1 2 4

8 16 32 64 Number of Pages 128 256 512 8 16 32 64 128 256

512 32 64 128 512 D Hash Downsampling 200 4 Downsampling 1 2 4 8

16 Number of Pages 256 Average Distance Figure 20. The average distance from top-20 motifs from our algorithm and the exact search algorithm. The bold/red line shows the default parameters. This shows that the quality of motifs is not sensitive to different parameter settings and very close to the result from the exact search algorithm. 30 Mask 60% 25 Mask 50% Masking Ratio Mask 40%

Mask 30% Mask 20% Exact search 20 15 10 A 5 0 2 Average Distance 30 4 8 16 32

Hash Downsampling 25 64 128 256 Iteration=5 Iteration=9 Iteration=10 Iteration=11 Iteration=20 Exact search Number of Iterations HDS=2 (4:1) HDS=3 (9:1) Exact search 20 15 10

B 5 512 C 0 2 4 8 16 32 64 Number of pages 128 256

512 2 4 8 16 32 64 Number of pages 128 256 512

Recently Viewed Presentations

  •   CRM  Site Promotion Hi-Tech: Search Engine & Directory

    CRM Site Promotion Hi-Tech: Search Engine & Directory

    Open Karoke player program Login ID & Password Select a song to sing Karaoke Time Decreasing Protecting their song & educate market through technology What can we sell on the net ? Competitive Price Light Weight One Stop Shopping H2F...
  • CLASSROOM ASSESSMENT TECHNIQUES: CATS Sara D. Miller September

    CLASSROOM ASSESSMENT TECHNIQUES: CATS Sara D. Miller September

    All the CATS we talk about to day are described in GREAT detail, including purpose, suggestions for use, and what to do with the data, in the Cross/Angelo book. **Many techniques and tools can be used for measuring both teaching...
  • LabVIEW for FRC - TSG Auto Sales - Home

    LabVIEW for FRC - TSG Auto Sales - Home

    There are over 50 FRC specific examples - more added every year. They tell you what to configure before running, and what can be changed while running. They all include wiring diagrams to aid in setting up and troubleshooting your...
  • The catcher in the rye by J.D. Salinger - Katie Janicek's E ...

    The catcher in the rye by J.D. Salinger - Katie Janicek's E ...

    The Catcher in the Rye. book check tomorrow—we will start reading on Thursday. ... Towards the end of Holden's conversation with Carl Luce, the practice of psychoanalysis comes up. Luce describes that his father, a psychoanalyst, would help Holden "recognize...
  • Summary of Cool Stars 13

    Summary of Cool Stars 13

    Summary of Cool Stars 13 Hamburg Germany July 5-9, 2004 Jeffrey L. Linsky JILA/University of Colorado Boulder Colorado
  • Philippians - Having a Life of Purpose

    Philippians - Having a Life of Purpose

    Philippians 2:1-4. Therefore if you have any encouragement from being united with Christ, if any comfort from his love, if any common sharing in the Spirit, if any tenderness and compassion, then make my joy complete by being like-minded, having...
  • Na Wai I Hana?

    Na Wai I Hana?

    Na nāhaumāna i pāʻani ma waho o ka papa. Kiʻi 4. Na nāhaumāna i ʻai i kaʻainaawakea. Kiʻi 5. Na nāhaumāna i hana i nāhaʻawina. Kiʻi 6. Na kekumu i hāʻawi i kahaʻawinapilihome. Kiʻi 7. Na nāhaumāna i hoʻomaʻemaʻe i...
  • Dostoevsky's Crime and Punishment

    Dostoevsky's Crime and Punishment

    Discuss the theme of religion and how it affects Raskolnikov after he leaves Marmeladov's family. Discuss the merging of the primary plot, Raskolnikov's crime, and the subplots involving Sonia and Dunia. Discuss what Raskolnikov faints upon seeing his mother and...