python - RapidFuzz discount matching of common tokens - Stack Overflow

admin2025-04-16  5

Using

choices = [rapidfuzz.utils.default_process(sentence=x) for x in allcrmaccts['Account Name']]
choices = [re.sub('|'.join(extraneous),'',x) for x in choices]
choices = sorted(choices)
queries = [rapidfuzz.utils.default_process(sentence=x) for x in givenaccts['Account Name']]
queries = [re.sub('|'.join(extraneous),'',x) for x in queries]
queries = sorted(queries)
allcrmsearch = rapidfuzz.process.cdist(choices=choices, queries=queries, workers=-1, scorer=rapidfuzz.fuzz.WRatio)
allcrm=pd.DataFrame(allcrmsearch, columns=choices, index=queries)

yields these results

allcrm[allcrm.max(axis=1)==100].idxmax(axis=1)

3b   the fibreglass                                                                 3b spa
3d carbon                                                                  3d cad  i  pvt 
3m                                                                                      3m
5m                                                                                     m m
a p technology                                                   2a s p a    divisione f2a
                                                                         ...              
z laser optoelektronik gmbh                                  2 e mechatronic gmbh   co  kg
zhermack spa                                                                        3b spa
zoltek                                                                                  z 
zsk stickmaschinen gmbh  zsk technical embroidery systems    2 e mechatronic gmbh   co  kg
zund systemtechnik ag                                            3s swiss solar systems ag
allcrm.at['toray advanced composites',].nlargest(10)

cobra advanced composites                               92.0
advanced animal care of mount pleasant                  85.5
advanced armour engineering  optimized armor            85.5
advanced bioenergy of the carolinas  abc                85.5
advanced composite structures    acs group              85.5
advanced computers and mobiles india private limited    85.5
advanced environmental services carolina air care       85.5
advanced healthcare staffing solutions                  85.5
advanced international multitech co     dizo bike       85.5
advanced logistics for aerospace   ala                  85.5

which gives undesired matches.

I made a scorer comparison table to see similarity values for known matches

t = []
s = 'toray advanced composites'
q = [col for col in allcrm.columns if 'toray' in col]
q.append('aerox advanced polymers')
for x in q:
    a = [
         rapidfuzz.fuzz.ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
         rapidfuzz.fuzz.partial_ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
         rapidfuzz.fuzz.token_ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
         rapidfuzz.fuzz.partial_ratio_alignment(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
         rapidfuzz.fuzz.partial_token_ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
         rapidfuzz.fuzz.WRatio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
         rapidfuzz.fuzz.QRatio(s1=s, s2=x, processor=rapidfuzz.utils.default_process)]
    t.append(a)

pd.DataFrame(t, 
             columns=['Ratio','Partial Ratio','Token Ratio','Partio Ratio Alignment', 'Partial Token Ratio', 'WRatio','QRatio'],
             index=q)

Is there a way to discount the importance of matching common-word tokens from choices? I created a histogram of the n-largest occurring tokens, and matching common words should be less valuable than matching near-distinct tokens.

转载请注明原文地址:http://www.anycun.com/QandA/1744751787a87104.html