Skip to content

Error converting comma-separated values "Katalogwerte" in cleansing bulk #723

@pclasen-eb

Description

@pclasen-eb

Description of the issue

Running the default Mastr download() currently results in an error processing certain files - as of March 8th

  • Error processing file 'Netze.xml': 'could not convert string to float: '334, 335''
  • Error processing file 'EinheitenVerbrennung.xml': 'could not convert string to float: '2442, 2442''

As a result, the corresponding sqlite tables are empty.
The issue seems to be caused by the dtype check for "O" in replace_mastr_katalogeintraege. This check does not match the pandas string dtype introduced in pandas>=3.0

If the suggested solution outlined below aligns with your expectations, I would be happy to prepare a pull request.

Please let me know if there are any additional implications or edge cases that I may have overlooked.

Steps to Reproduce

  1. run default db.download()
  2. check if grids or combustion_extended tables in sqlite are empty

Ideas of solution

  • adjust the if statement in replace_mastr_katalogeintraege (utils_cleansing_bulk.py) to include pandas string_dtype, i.e. instead of if df[column_name].dtype == "O":
  • use if (pd.api.types.is_string_dtype(df[column_name]) or pd.api.types.is_object_dtype(df[column_name])):
  • as the object dtype check is still included it should work the same as before with older pandas versions

Context and Environment

  • Version used: v0.16.1 and latest commit on develop 297cd59 - specifically with pandas 3.0.1
  • Operating system: unix/mac os
  • Environment setup and (python) version: python 3.12, pandas 3.0.1

Workflow checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions