Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace validation does not work on columns that only contain numbers #7

Open
wolces opened this issue Jan 12, 2018 · 3 comments
Open
Labels

Comments

@wolces
Copy link

wolces commented Jan 12, 2018

If all entries in a column are numeric, then whitespace validation will not find errors in any entries in that column. If a single entry in a column is non-numeric, then whitespace validation will work on all entries in that column. For example:

import pandas as pd
from io import StringIO
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation

schema = Schema([
    Column('col1', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column('col2', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()])
])

test_data = pd.read_csv(StringIO('''col1,col2
1,3
4,p
 2 ,3
3, 9
1 ,3
6,2 
'''))

errors = schema.validate(test_data)

for error in errors:
    print(error)

returns

{row: 3, column: "col2"}: " 9" contains leading whitespace
{row: 5, column: "col2"}: "2 " contains trailing whitespace
@multimeric multimeric added the bug label Feb 26, 2018
@multimeric
Copy link
Owner

The best way to help me with this would be to add a test for LeadingWhitespaceValidation or TrailingWhitespaceValidation in test/test_validation.py that currently fails for this example. Then I can very quickly write a fix for it.

@multimeric
Copy link
Owner

multimeric commented Apr 19, 2018

I've had a look into this, and it's not exactly a bug in PandasSchema. The problem is, pd.read_csv does some automatic type conversion, and sees that, because series 1 entirely consists of integers, it should be converted into an integer series, and thus it loses the whitespace.

If you make sure that everything is parsed as a string, by setting the dtype manually, the validations will work as expected:

test_data = pd.read_csv(StringIO('''col1
3
3
 9
3
2
'''), dtype=str)

@multimeric
Copy link
Owner

I'll try to update the documentation to make this clearer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants