Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize_address_record() raises unparseable address error when using full street directionals #31

Closed
philiporlando opened this issue Jul 31, 2023 · 5 comments

Comments

@philiporlando
Copy link

The below example raises an unparseable address error:

from scourgify import normalize_address_record

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

normalize_address_record(address)
# scourgify.exceptions.UnParseableAddressError: UNPARSEABLE ADDRESS: Unable to break this address into its component parts, OrderedDict([('address_line_1', '38350 40TH ST EAST 100 PALMDALE CA 93552'), ('address_line_2', None), ('city', None), ('state', None), ('postal_code', None)])

Abbreviating the street directional value (changing EAST to E) avoids this error and produces the expected results:

from scourgify import normalize_address_record

address = "38350 40TH ST E 100 PALMDALE CA 93552"

normalize_address_record(address)
# OrderedDict([('address_line_1', '38350 40TH ST E'), ('address_line_2', 'UNIT 100'), ('city', 'PALMDALE'), ('state', 'CA'), ('postal_code', '93552')])

Is it possible to look into this and ensure that full directional names do not raise unparseable address errors? The USPS prefers abbreviated directionals, but still considers full names acceptable.

Please let me know if you have any questions about this. Thank you in advance for your help troubleshooting this!

@zak-flex
Copy link

I have a similar issue with this address: 1345 Towne Lake Hills South Drive, Woodstock, GA, 30189
This variation is parseable: 1345 Towne Lake Hills S Dr, Woodstock, GA, 30189'=

@fablet
Copy link
Member

fablet commented Dec 14, 2023

Unfortunately, this is an issue with the usaddress package. You can check tagging behaviors in their UI: https://parserator.datamade.us/usaddress/
The usaddress.tag results are this:

PARSED TOKENS:    [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
UNCERTAIN LABEL:  StreetName```

You can see usaddress is incorrectly identifying the post-directional as a pre-directional, which is causing it to identify the street name a second time.

VS `38350 40TH ST E 100 PALMDALE CA 93552`

(OrderedDict([('AddressNumber', '38350'),
('StreetName', '40TH'),
('StreetNamePostType', 'ST'),
('StreetNamePostDirectional', 'E'),
('OccupancyIdentifier', '100'),
('PlaceName', 'PALMDALE'),
('StateName', 'CA'),
('ZipCode', '93552')]),
'Street Address')

This issue needs to be resubmitted to that package: https://github.com/datamade/usaddress/issues

@philiporlando
Copy link
Author

philiporlando commented Dec 30, 2023

@fablet, I appreciate your input. Like you, I also encountered the parsing error using the Parserator API at https://parserator.datamade.us/usaddress.

However, I've successfully used usaddress.parse() with the address "38350 40TH ST EAST 100 PALMDALE CA 93552" with usaddress version 0.5.10:

import usaddress

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

print(usaddress.parse(address))

# [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]

It seems the latest version of usaddress might have resolved this pre- vs post-directional issue, however, I'm uncertain about the usaddress version utilized by the Parserator API,

Unfortunately, usaddress.tag() now raises a duplicate street name error when using the latest version:

import usaddress

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

usaddress.tag(address)

# Traceback (most recent call last):
#   File "/home/user/usaddress_parse_error/usaddress_parse_error.py", line 5, in <module>
#     usaddress.tag(address)
#   File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/usaddress/__init__.py", line 177, in tag
#     raise RepeatedLabelError(address_string, parse(address_string),
# usaddress.RepeatedLabelError: 
# ERROR: Unable to tag this string because more than one area of the string has the same label

# ORIGINAL STRING:  38350 40TH ST EAST 100 PALMDALE CA 93552
# PARSED TOKENS:    [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
# UNCERTAIN LABEL:  StreetName

# When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

# To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!

# For more information, see the documentation at https://usaddress.readthedocs.io/

So it seems that we are trading one parsing error for another. That being said, the newest version of usaddress.parse() is working for me, which is the function that I need for my business case.

Do you know if there are plans to update usaddress-scourgify's dependency on usaddress from 0.5.9 to 0.5.10 in the near future? I hoping that this would avoid the error I'm seeing with normalize_address_record().

Thank you again for assisting with this issue.

@philiporlando
Copy link
Author

philiporlando commented Dec 30, 2023

Ok, I just tried forking this repo and updating its usaddress dependency to 0.5.10.

Unfortunately, this did not resolve my issues:

from scourgify import normalize_address_record

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

normalize_address_record(address)

# Traceback (most recent call last):
#   File "/home/user/usaddress_parse_error/usaddress_parse_error.py", line 5, in <module>
#     normalize_address_record(address)
#   File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/scourgify/normalize.py", line 159, in normalize_address_record
#     return normalize_addr_str(
#   File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/scourgify/normalize.py", line 267, in normalize_addr_str
#     raise UnParseableAddressError(None, None, addr_rec)
# scourgify.exceptions.UnParseableAddressError: UNPARSEABLE ADDRESS: Unable to break this address into its component parts, OrderedDict([('address_line_1', '38350 40TH ST EAST 100 PALMDALE CA 93552'), ('address_line_2', None), ('city', None), ('state', None), ('postal_code', None)])

It probably makes the most sense to open a new issue within the usaddress repo and try to address the error with usaddress.tag().

@philiporlando
Copy link
Author

@fablet, I've opened this issue to address the root of the problem. Thanks again for the support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants