Using
choices = [rapidfuzz.utils.default_process(sentence=x) for x in allcrmaccts['Account Name']]
choices = [re.sub('|'.join(extraneous),'',x) for x in choices]
choices = sorted(choices)
queries = [rapidfuzz.utils.default_process(sentence=x) for x in givenaccts['Account Name']]
queries = [re.sub('|'.join(extraneous),'',x) for x in queries]
queries = sorted(queries)
allcrmsearch = rapidfuzz.process.cdist(choices=choices, queries=queries, workers=-1, scorer=rapidfuzz.fuzz.WRatio)
allcrm=pd.DataFrame(allcrmsearch, columns=choices, index=queries)
yields these results
allcrm[allcrm.max(axis=1)==100].idxmax(axis=1)
3b the fibreglass 3b spa
3d carbon 3d cad i pvt
3m 3m
5m m m
a p technology 2a s p a divisione f2a
...
z laser optoelektronik gmbh 2 e mechatronic gmbh co kg
zhermack spa 3b spa
zoltek z
zsk stickmaschinen gmbh zsk technical embroidery systems 2 e mechatronic gmbh co kg
zund systemtechnik ag 3s swiss solar systems ag
allcrm.at['toray advanced composites',].nlargest(10)
cobra advanced composites 92.0
advanced animal care of mount pleasant 85.5
advanced armour engineering optimized armor 85.5
advanced bioenergy of the carolinas abc 85.5
advanced composite structures acs group 85.5
advanced computers and mobiles india private limited 85.5
advanced environmental services carolina air care 85.5
advanced healthcare staffing solutions 85.5
advanced international multitech co dizo bike 85.5
advanced logistics for aerospace ala 85.5
which gives undesired matches.
I made a scorer
comparison table to see similarity values for known matches
t = []
s = 'toray advanced composites'
q = [col for col in allcrm.columns if 'toray' in col]
q.append('aerox advanced polymers')
for x in q:
a = [
rapidfuzz.fuzz.ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
rapidfuzz.fuzz.partial_ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
rapidfuzz.fuzz.token_ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
rapidfuzz.fuzz.partial_ratio_alignment(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
rapidfuzz.fuzz.partial_token_ratio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
rapidfuzz.fuzz.WRatio(s1=s, s2=x, processor=rapidfuzz.utils.default_process),
rapidfuzz.fuzz.QRatio(s1=s, s2=x, processor=rapidfuzz.utils.default_process)]
t.append(a)
pd.DataFrame(t,
columns=['Ratio','Partial Ratio','Token Ratio','Partio Ratio Alignment', 'Partial Token Ratio', 'WRatio','QRatio'],
index=q)
Is there a way to discount the importance of matching common-word tokens from choices
? I created a histogram of the n-largest occurring tokens, and matching common words should be less valuable than matching near-distinct tokens.