# Measures of Coincidence - Columbia University

Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas A study of different measures Smadja, McKeown, and Hatzivassiloglou (1996): Translating Collocations for Bilingual Lexicons: A Statistical Approach Use aligned parallel corpora (Hansards)

Task: Find translation for a word group across languages Sketch of algorithm Start with set of collocations in French Find candidate single word translations according to association between original collocation and translation Measure association between source

collocation and pairs of candidate words Expand iteratively to triplets, etc. by recalculating association Dice vs. SI Dice depends on conditional probabilities only SI depends on the marginals: logP(X|

Y)-logP(X) SI depends on how rare X is Limit behavior Asymmetry Many kinds of asymmetry Between X and Y Between X=1 and X=0 1-1 matches versus 0-0 matches

Adding 0-0 matches does not change Dice Adding 0-0 matches always increases SI Effect of asymmetry Hypothetical scenario on 100 sentences A,B appear together twice, by themselves three times each Dice: 22 / (5+5) = 0.4

SI: log (0.02 / (0.050.05)) = 3 bits MI: 0.0457 bits Reversing one and zeroes Now replace every 1 with 0 and vice versa New variables A, B occur together 92 times, each occurs by itself three times Dice: 292 / (95 + 95) = 0.9684 MI: Unchanged (0.0457 bits)

SI: log(0.92 / (0.950.95)) = 0.0277 bits Explaining the behavior Limit effect as P(X) decreases with P(X|Y) constant P(X) eventually dominates SI Makes SI (and MI) more sensitive to estimation errors

Bounds and testing purpose No upper bound for SI and MI Dice is always between 0 and 1 Easy to test SI/MI for independence

Easy to test Dice for correlation Empirical comparison How to compare without redoing the entire experiment? Solution: Use competing measure in the last round Test cases where the correct solution is available

Provide lower bound on competitor error Empirical results 45 French collocations 2 did not produce any candidate translation Dice resulted in 36 correct, 7 incorrect translations SI resulted in 26 correct, 17 incorrect translations

Re-examining contingency tables Ted Dunning, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics, 1993. Problem: Asymptotic normality assumptions How much data is enough? Are researchers aware of the need for statistical validity analysis?

Rarity of words Empirical counts on words show that 2030% of words appear less than 1 in 50,000 words Estimating binomial as normal: Good as long as np(1-p) > 5 Significance overestimated by 20% for np=1, 40 for np=0.1, 1020 for np=0.01

Likelihood in parameter spaces Parametric model (known except for parameter values) Likelihood function H(;k)k) Hypothesis represented by a point 0 Likelihood ratio

max0 H (;k) k ) max H (;k) k ) Test statistic: -2log Rapidly approaches 2 distribution for binomial H

Comparing to chi-square Leads to same formula as Pearsons chisquare statistic when approximating with normal distribution Diverges significantly from chi-square for low np Closely follows chi-square distribution Experimental results 32,000 words of financial text from

Switzerland Find highly correlated word pairs Observe top-ranked entries for log-likelihood and chi-square Chi-square leads to huge scores for rare pairs 2,682 of 2,693 bigrams violate assumptions

## Recently Viewed Presentations

• Cattle Problem. If thou art diligent and wise, O stranger, compute the number of cattle of the Sun, who once upon a time grazed on the fields of the Thrinacian isle of Sicily, divided into four herds of different colours,...
• Dr. Martin Luther King, Jr. organized a 381-day boycott of the bus system to protest segregation ... Nearly 2,000 black demonstrators, including King, were jailed by County Sheriff James Clark for contempt of court, juvenile delinquency, and parading without a...
• [email protected] Undergraduate Research Opportunities Program (UROP) ... CRS patients display two or more of the following cardinal symptoms: reduction in sense of smell, facial pressure or pain, nasal obstruction, and nasal discharge1. ... UROP Proposal Prep Session
• Tissue by Imtiaz Dharker "There was some beautiful writing about 'Tissue' which proved to have its own powerful effect on the students." Examiner's Report 2018. 30 mins (additional material) This poem is from the Power and Conflict Anthology and was...
• They are the lever, pulley, wheel and axel, inclined plane, wedge, and screw. The IMA of all machines is the ratio of distances moved. Compound machines: A compound machine consists of two or more simple machines linked so that the...
• El escudo de la ciudad se le otorgó a Lima en 1537, en él aparecen los símbolos del empeador Carlos V y su madre, doña Juana, junto con las estrellas de los Reyes Magos. En su orla lleva la inscripción:...
• If not, is the student sufficiently able to access information through the use of support services (e.g., sign language interpreting, oral or Cued Speech transliteration, captioning, notetaking)? ... (The Continuum of Services for Students with Disabilities) Retrieved from:
• The 1.5 IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 IQR above the third quartile or below the first quartile. CHECK OUT THESE COMPUTATIONS: In the New York travel time data, we found...