Whitespace validation does not work on columns that only contain numbers #7

wolces · 2018-01-12T01:05:18Z

If all entries in a column are numeric, then whitespace validation will not find errors in any entries in that column. If a single entry in a column is non-numeric, then whitespace validation will work on all entries in that column. For example:

import pandas as pd
from io import StringIO
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation

schema = Schema([
    Column('col1', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column('col2', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()])
])

test_data = pd.read_csv(StringIO('''col1,col2
1,3
4,p
 2 ,3
3, 9
1 ,3
6,2 
'''))

errors = schema.validate(test_data)

for error in errors:
    print(error)

returns

{row: 3, column: "col2"}: " 9" contains leading whitespace
{row: 5, column: "col2"}: "2 " contains trailing whitespace

The text was updated successfully, but these errors were encountered:

multimeric · 2018-02-26T13:20:38Z

The best way to help me with this would be to add a test for LeadingWhitespaceValidation or TrailingWhitespaceValidation in test/test_validation.py that currently fails for this example. Then I can very quickly write a fix for it.

multimeric · 2018-04-19T08:58:02Z

I've had a look into this, and it's not exactly a bug in PandasSchema. The problem is, pd.read_csv does some automatic type conversion, and sees that, because series 1 entirely consists of integers, it should be converted into an integer series, and thus it loses the whitespace.

If you make sure that everything is parsed as a string, by setting the dtype manually, the validations will work as expected:

test_data = pd.read_csv(StringIO('''col1
3
3
 9
3
2
'''), dtype=str)

multimeric · 2018-04-19T08:58:55Z

I'll try to update the documentation to make this clearer

multimeric added the bug label Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace validation does not work on columns that only contain numbers #7

Whitespace validation does not work on columns that only contain numbers #7

wolces commented Jan 12, 2018

multimeric commented Feb 26, 2018

multimeric commented Apr 19, 2018 •

edited

Loading

multimeric commented Apr 19, 2018

Whitespace validation does not work on columns that only contain numbers #7

Whitespace validation does not work on columns that only contain numbers #7

Comments

wolces commented Jan 12, 2018

multimeric commented Feb 26, 2018

multimeric commented Apr 19, 2018 • edited Loading

multimeric commented Apr 19, 2018

multimeric commented Apr 19, 2018 •

edited

Loading