Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting column dtypes | TypeError: float() argument must be a string or a number, not 'Tag' #3

Open
gumdropsteve opened this issue Mar 27, 2021 · 0 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@gumdropsteve
Copy link
Owner

gumdropsteve commented Mar 27, 2021

What?

Issue: converting column that's supposed to be float to float.

Error in short...

FileNotFoundError: [Errno 2] No such file or directory: 'data/niseko_hokkaido_prefecture.parquet'

During handling of the above exception, another exception occurred:

TypeError: float() argument must be a string or a number, not 'Tag'

The above exception was the direct cause of the following exception: ValueError: setting an array element with a sequence.

Where?

Specific line

df[column] = df[column].astype(self.dtypes[column])

Area of events

airbnb/scrape.py

Lines 347 to 383 in 6c1bd3d

# add this scrape to the location's existing dataset
try:
pd.concat([pd.read_parquet(f'{self.data_dir}{self.location_alias}.parquet'),
pd.DataFrame(data, columns=self.names)], axis=0).to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False)
# first time we've scraped this location, make a new dataset
except:
# check this is actually new so we don't accidenly overwrite existing data (delete 'y'# from the below line if you want to perform manual check, outherwise defaults to make new file)
i = 'y'#input(f'recording new location: {self.location_alias}? (y/n) ')
if i == 'y':
# make dataframe from scraped data, column names from __init__()
df = pd.DataFrame(data, columns=self.names)
# go through each column
for column in self.dtypes:
# our bool data is scraped as True/None, we need True/False
if 'bool' in column:
# fill None values in bool column with False
df[column] = df[column].fillna(False)
# convert column to expected dtype for parquet
df[column] = df[column].astype(self.dtypes[column])
# write new parquet file
df.to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False)
del df # free up space
else:
raise Exception("not recording a new location, what's going on?")
def scrape_search(self, base_link, search_alias, _filter, n_pages='auto', printout=False):
"""
record results of a given search link
"""
# get 1st page
base_link_page_1, t = get_page(base_link)
# record the 1st page
if printout:
print(self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter))
else:
self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter)

Solution?

Thinking one of the scrape tags got changed in past few days and needs to be updated. If so, it'll be one of these functions (data collection / scrape functions).

Could start by looking at data for float columns and see which (if any) is/are weird. Could start by looking at tags on airbnb site and comparing to tags set for float columns.

Code to reproduce & full error...

Notebook: https://gist.github.com/gumdropsteve/801bad4b453b88b247c7d05b7bd5d488

Click here to see full error output
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-fd305a920de5> in record_dataset(self, listings, tos, _filter)
    348         try:
--> 349             pd.concat([pd.read_parquet(f'{self.data_dir}{self.location_alias}.parquet'), 
    350                        pd.DataFrame(data, columns=self.names)], axis=0).to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False)

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
    459     return impl.read(
--> 460         path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
    461     )

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    217             storage_options=storage_options,
--> 218             mode="rb",
    219         )

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
     81         # this branch is used for example when reading from non-fsspec URLs
---> 82         handles = get_handle(path_or_handle, mode, is_text=False)
     83         fs = None

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    650             # Binary mode
--> 651             handle = open(handle, ioargs.mode)
    652         handles.append(handle)

FileNotFoundError: [Errno 2] No such file or directory: 'data/niseko_hokkaido_prefecture.parquet'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Tag'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-1-fd305a920de5> in <module>
    457 
    458     # execute delayed scrapes
--> 459     compute(*collection)
    460 
    461     print(f'runtime: {time.time() - start}')

~\Anaconda3\envs\scrape\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    559         postcomputes.append(x.__dask_postcompute__())
    560 
--> 561     results = schedule(dsk, keys, **kwargs)
    562     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    563 

~\Anaconda3\envs\scrape\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     82         get_id=_thread_get_id,
     83         pack_exception=pack_exception,
---> 84         **kwargs
     85     )
     86 

~\Anaconda3\envs\scrape\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    485                         _execute_task(task, data)  # Re-execute locally
    486                     else:
--> 487                         raise_exception(exc, tb)
    488                 res, worker_id = loads(res_info)
    489                 state["cache"][key] = res

~\Anaconda3\envs\scrape\lib\site-packages\dask\local.py in reraise(exc, tb)
    315     if exc.__traceback__ is not tb:
    316         raise exc.with_traceback(tb)
--> 317     raise exc
    318 
    319 

~\Anaconda3\envs\scrape\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    220     try:
    221         task, data = loads(task_info)
--> 222         result = _execute_task(task, data)
    223         id = get_id()
    224         result = dumps((result, id))

~\Anaconda3\envs\scrape\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

~\Anaconda3\envs\scrape\lib\site-packages\dask\utils.py in apply(func, args, kwargs)
     29 def apply(func, args, kwargs=None):
     30     if kwargs:
---> 31         return func(*args, **kwargs)
     32     else:
     33         return func(*args)

<ipython-input-1-fd305a920de5> in scrape_types(self, printout)
    416         if last_date_recorded != today:
    417             # default search
--> 418             self.scrape_search(self.base_link, f'{self.location_alias}', _filter='', printout=printout)
    419             self.scrape_search(f'{self.base_link}?superhost=true', f'{self.location_alias}_super_hosts', _filter='super_hosts', printout=printout)
    420 

<ipython-input-1-fd305a920de5> in scrape_search(self, base_link, search_alias, _filter, n_pages, printout)
    381             print(self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter))
    382         else:
--> 383             self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter)
    384 
    385         # get urls for other pages

<ipython-input-1-fd305a920de5> in record_dataset(self, listings, tos, _filter)
    363                         df[column] = df[column].fillna(False)
    364                     # convert column to expected dtype for parquet
--> 365                     df[column] = df[column].astype(self.dtypes[column])
    366                 # write new parquet file
    367                 df.to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False)

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
   5872         else:
   5873             # else, only a single dtype is given
-> 5874             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5875             return self._constructor(new_data).__finalize__(self, method="astype")
   5876 

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1095     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
   1096         # Explicit copy, or required since NumPy can't view from / to object.
-> 1097         return arr.astype(dtype, copy=True)
   1098 
   1099     return arr.view(dtype)

ValueError: setting an array element with a sequence.

@gumdropsteve gumdropsteve added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels Mar 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant