Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DBT Seed in dbt-athena-community is not working for .csv with encoding utf-8 #734

Open
2 tasks done
juliodias20 opened this issue Oct 11, 2024 · 6 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@juliodias20
Copy link

juliodias20 commented Oct 11, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I would like to start by apologizing if there is already a bug report on this subject, but I couldn't find it.

Lets go, I have a .csv file (sds_teste.csv) that I am using as a seed, like this:
image

So, I execute the command dbt seed -s sds_teste and the dbt is successful executed
image

But, when i execute a select to see the table created by dbt seed command, I can see that the table cannot read the special characters (accented letters)

image

I already try some things that I found around de internet, like to pass the encoding: utf-8, but i not found nothing that working.

My profiles.yml
image

Expected Behavior

The expected behavir is that the dbt seed would can read a .csv file in encoding utf-8.

Should be: A text with special characters, like Ã, Á, Í, or Ç
Instead of: A text with special characters, like �, �, �, or �

Steps To Reproduce

1 - Install the python 3.11.9 in a windows computer
2 - Create a python environment with python venv
3 - Install dbt-core==1.8.7 and dbt-athena-community==1.8.4 by pip install
4 - Create a dbt project
5 - Create a .csv file in the folder seeds/ and write some example with special characters
6 - Configure the profile.yml to connect a AWS Athena(storage: AWS S3)
7 - Run the dbt seed command

Relevant log output

============================== 13:38:20.928977 | c0ac6503-3146-44d1-84fa-552929776d79 ==============================
�[0m13:38:20.928977 [info ] [MainThread]: Running with dbt=1.8.7
�[0m13:38:20.929979 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'fail_fast': 'False', 'profiles_dir': 'C:\\Users\\julio.padoveze\\.dbt', 'log_path': 'C:\\Users\\julio.padoveze\\github\\prd-Copastur-dbt-source-qv\\logs', 'version_check': 'True', 'debug': 'False', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'empty': 'None', 'log_format': 'default', 'introspect': 'True', 'invocation_command': 'dbt seed -s sds_teste', 'static_parser': 'True', 'target_path': 'None', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'send_anonymous_usage_stats': 'True'}
�[0m13:38:21.527840 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521A27B10>]}
�[0m13:38:21.561425 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x000002051CDCDE10>]}
�[0m13:38:21.563666 [info ] [MainThread]: Registered adapter: athena=1.8.4
�[0m13:38:21.575796 [debug] [MainThread]: checksum: 4af21dafb485259c48497ac86b711ddb1982f3d0f1c0ca4e09356de488b753c0, vars: {}, profile: , target: , version: 1.8.7
�[0m13:38:21.689602 [info ] [MainThread]: Unable to do partial parsing because of a version mismatch
�[0m13:38:21.690599 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'partial_parser', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521C79010>]}
�[0m13:38:22.937805 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020522118250>]}
�[0m13:38:23.152658 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020522129ED0>]}
�[0m13:38:23.152658 [info ] [MainThread]: Found 17 models, 9 seeds, 96 sources, 568 macros
�[0m13:38:23.153659 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521F006D0>]}
�[0m13:38:23.155683 [info ] [MainThread]: 
�[0m13:38:23.157687 [debug] [MainThread]: Acquiring new athena connection 'master'
�[0m13:38:23.160683 [debug] [ThreadPool]: Acquiring new athena connection 'list_awsdatacatalog'
�[0m13:38:23.161689 [debug] [ThreadPool]: Opening a new connection, currently in state init
�[0m13:38:24.834482 [debug] [ThreadPool]: On list_awsdatacatalog: Close
�[0m13:38:24.840486 [debug] [ThreadPool]: Acquiring new athena connection 'list_awsdatacatalog_db_dbt_validation'
�[0m13:38:24.840486 [debug] [ThreadPool]: Opening a new connection, currently in state init
�[0m13:38:29.476313 [debug] [ThreadPool]: On list_awsdatacatalog_db_dbt_validation: Close
�[0m13:38:29.495682 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521B74F10>]}
�[0m13:38:29.496682 [info ] [MainThread]: Concurrency: 10 threads (target='dev')
�[0m13:38:29.498679 [info ] [MainThread]: 
�[0m13:38:29.512664 [debug] [Thread-1 (]: Began running node seed.qv.sds_teste
�[0m13:38:29.514172 [info ] [Thread-1 (]: 1 of 1 START seed file db_dbt_validation.db_qvextracao__sds_teste .............. [RUN]
�[0m13:38:29.516618 [debug] [Thread-1 (]: Acquiring new athena connection 'seed.qv.sds_teste'
�[0m13:38:29.518105 [debug] [Thread-1 (]: Began compiling node seed.qv.sds_teste
�[0m13:38:29.519110 [debug] [Thread-1 (]: Began executing node seed.qv.sds_teste
�[0m13:38:29.553617 [debug] [Thread-1 (]: seed by upload...
�[0m13:38:29.566419 [debug] [Thread-1 (]: Opening a new connection, currently in state init
�[0m13:38:32.830681 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exists - Ignoring
�[0m13:38:32.832678 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist - Ignoring
�[0m13:38:32.834710 [debug] [Thread-1 (]: Dropping relation via Glue and S3 APIs
�[0m13:38:34.321827 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exists - Ignoring
�[0m13:38:34.323331 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist - Ignoring
�[0m13:38:35.891369 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist and will not be deleted, ignoring
�[0m13:38:35.893875 [debug] [Thread-1 (]: Using athena connection "seed.qv.sds_teste"
�[0m13:38:35.893875 [debug] [Thread-1 (]: On seed.qv.sds_teste: -- /* {"app": "dbt", "dbt_version": "1.8.7", "profile_name": "qv", "target_name": "dev", "node_id": "seed.qv.sds_teste"} */
create external table `db_dbt_validation`.`db_qvextracao__sds_teste__dbt_tmp` (`id` string, `some_text` string)
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location 's3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp'
    tblproperties (
      'skip.header.line.count'='1'
    )
  
�[0m13:38:36.637497 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Athena query ID 87dea510-7789-40ce-849c-4d2a4d6428bf
�[0m13:38:38.371586 [debug] [Thread-1 (]: SQL status: OK -1 in 2.477 seconds
�[0m13:38:38.386232 [debug] [Thread-1 (]: Skip partitioning: False
�[0m13:38:38.387230 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: get_work_group for %s
�[0m13:38:39.758786 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: S3 path does not exist
�[0m13:38:39.759788 [debug] [Thread-1 (]: Using athena connection "seed.qv.sds_teste"
�[0m13:38:39.759788 [debug] [Thread-1 (]: On seed.qv.sds_teste: -- /* {"app": "dbt", "dbt_version": "1.8.7", "profile_name": "qv", "target_name": "dev", "node_id": "seed.qv.sds_teste"} */
create table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste"
    with (
      table_type='hive',
      is_external=true,
      format='parquet'
    )
    as
      
    select
        
              cast(nullif("id", '') as integer) as "id"
            , 
              cast(nullif("some_text", '') as varchar) as "some_text"
            
    from
        "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp"
  
�[0m13:38:39.966615 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Athena query ID 3ff995b0-dbb9-42b2-ab05-90844aeb9925
�[0m13:38:42.972477 [debug] [Thread-1 (]: SQL status: OK 2 in 3.212 seconds
�[0m13:38:44.460039 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table_name : 490445445844.db_dbt_validation.db_qvextracao__sds_teste__dbt_tmp
�[0m13:38:44.462029 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table type : TableType.TABLE
�[0m13:38:44.462029 [debug] [Thread-1 (]: Dropping relation via Glue and S3 APIs
�[0m13:38:45.992330 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table_name : 490445445844.db_dbt_validation.db_qvextracao__sds_teste__dbt_tmp
�[0m13:38:45.993334 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table type : TableType.TABLE
�[0m13:38:45.994409 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" is stored in s3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp
�[0m13:38:46.756532 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleting table data: path='s3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp', bucket='prd-lakehouse-copastur-analytics', prefix='analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp/'
�[0m13:38:49.383198 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleted table from glue catalog: "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp"

Environment

- OS: Windons 11
- Python: 3.11.9
- dbt-core: 1.8.7
- dbt-athena-community: 1.8.4

Which database adapter are you using with dbt?

other (mention it in "Additional Context")

Additional Context

No response

@juliodias20 juliodias20 added bug Something isn't working triage labels Oct 11, 2024
@dbeatty10
Copy link

Thanks for reaching out @juliodias20 !

Special characters worked for me when I tried with dbt-duckdb, so this might be specific to the dbt-athena-community adapter rather than dbt-core. So I'm going to transfer this issue to that repository instead.

Example

See below for my output when using dbt-duckdb.

Create this file:

seeds/my_seed.csv

id,some_text
1,ABC
2,Ã Á Í Ç

Run these commands:

dbt seed
dbt show --inline 'select * from {{ ref("my_seed") }}'

See this output:

| id | some_text |
| -- | --------- |
|  1 | ABC       |
|  2 | Ã Á Í Ç   |

@dbeatty10 dbeatty10 transferred this issue from dbt-labs/dbt-core Oct 12, 2024
@juliodias20
Copy link
Author

Thanks for the collaboration @dbeatty10 !

Now that you said this, I tested the same case with the Databricks adapter and it works correctly! It really sounds like a problem with athena adapter.

@e-quili
Copy link

e-quili commented Oct 14, 2024

Hello,

It works on my side with dbt-athena on Windows.
Could you please try to add this parameter to .vscode/settings.json and open a new terminal in vscode ?

{
    "terminal.integrated.env.windows": {
        "PYTHONUTF8": "1"
    }
}

@juliodias20
Copy link
Author

Hello @e-quili , thank you so much for collaboration! I tested this solution and it works!!

I will use this in my local environment for developments, but I still think that there is a bug, once that anothers adapters can identify the Encoding of the .csv file. What do you think?

@CommonCrisis
Copy link
Contributor

CommonCrisis commented Oct 31, 2024

I have a similar problem: I cannot even upload the csv file to s3 that contains these words:

Sedlišćo pódzajtšo, Dolna Łužyca, etc.

This will not work with the athena dbt adapter even if the letters are utf8.
I checked all possbilities of misconfigured encodings with:

os.environ["PYTHONIOENCODING"]
sys.getfilesystemencoding()
sys.getdefaultencoding()

And they are all set to utf8.

The issue is withing agate csv_py3.py:

The function writerow() calling self.writer.writerow(row) if I add a try / except I get all the rows that cannot be processed.

So here I am actually stuck - these are the rows that cannot be processed because somehow it always checks with cp1252 which does not contain any special characters.

Here is some test data:

"Sedlitz Ost (Sedlišćo pódzajtšo)",
"Senftenberg (Zły Komorow)",
"Cottbus Hbf Calau (NL) Chóśebuz gł.dw",
"Gollmitz (NL) Chańc (Dolna Łužyca)",
"Calau (NL) Kalawa (Dolna Łužyca",
"Kolkwitz Süd Gołkojce pódpołdnjo",
"Cottbus-Sandow Chóśebuz-Žandow",
"Cottbus-Merzdorf Chóśebuz-Žylowk",
"Cottbus-Willmersdorf Nord Chóśebuz-Rogoznow pódpołnoc",

One additional note:

If I just read my csv and write it again (as dbt does) it just works:

my_seed = r"path/my_csv.csv"

with open(my_seed, encoding="utf-8") as f:
    data = csv.reader(f)

table = agate.Table.from_csv(my_seed)

table.to_csv("asd.csv")

Maybe someone can lead me in the right direction where local csv file is actually read in dbt - I cannot find the creation of the agate table.


UPDATE:

Okay it is actually Powershell causing the issues. If I use git bash it just works fine.
Windows...

@CommonCrisis
Copy link
Contributor

Okay for everyone having issues with seeds and dbt athena on Windows: Set this: $Env:PYTHONUTF8=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants