Setting column dtypes | TypeError: float() argument must be a string or a number, not 'Tag' #3

gumdropsteve · 2021-03-27T01:04:46Z

What?

Issue: converting column that's supposed to be float to float.

Error in short...

FileNotFoundError: [Errno 2] No such file or directory: 'data/niseko_hokkaido_prefecture.parquet'

During handling of the above exception, another exception occurred:

TypeError: float() argument must be a string or a number, not 'Tag'

The above exception was the direct cause of the following exception: ValueError: setting an array element with a sequence.

Where?

Specific line

airbnb/scrape.py

Line 365 in 6c1bd3d

df[column] = df[column].astype(self.dtypes[column])

Area of events

airbnb/scrape.py

Lines 347 to 383 in 6c1bd3d

    
                   # add this scrape to the location's existing dataset 
        
                   try: 
        
                       pd.concat([pd.read_parquet(f'{self.data_dir}{self.location_alias}.parquet'),  
        
                                  pd.DataFrame(data, columns=self.names)], axis=0).to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False) 
        
                   # first time we've scraped this location, make a new dataset 
        
                   except: 
        
                       # check this is actually new so we don't accidenly overwrite existing data (delete 'y'# from the below line if you want to perform manual check, outherwise defaults to make new file) 
        
                       i = 'y'#input(f'recording new location: {self.location_alias}? (y/n) ') 
        
                       if i == 'y': 
        
                           # make dataframe from scraped data, column names from __init__() 
        
                           df = pd.DataFrame(data, columns=self.names) 
        
                           # go through each column 
        
                           for column in self.dtypes: 
        
                               # our bool data is scraped as True/None, we need True/False 
        
                               if 'bool' in column: 
        
                                   # fill None values in bool column with False 
        
                                   df[column] = df[column].fillna(False) 
        
                               # convert column to expected dtype for parquet 
        
                               df[column] = df[column].astype(self.dtypes[column]) 
        
                           # write new parquet file 
        
                           df.to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False) 
        
                           del df  # free up space 
        
                       else: 
        
                           raise Exception("not recording a new location, what's going on?") 
        
               def scrape_search(self, base_link, search_alias, _filter, n_pages='auto', printout=False): 
        
                   """ 
        
                   record results of a given search link 
        
                   """         
        
                   # get 1st page 
        
                   base_link_page_1, t = get_page(base_link) 
        
                   # record the 1st page 
        
                   if printout: 
        
                       print(self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter)) 
        
                   else: 
        
                       self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter)

Solution?

Thinking one of the scrape tags got changed in past few days and needs to be updated. If so, it'll be one of these functions (data collection / scrape functions).

Could start by looking at data for float columns and see which (if any) is/are weird. Could start by looking at tags on airbnb site and comparing to tags set for float columns.

Code to reproduce & full error...

Notebook: https://gist.github.com/gumdropsteve/801bad4b453b88b247c7d05b7bd5d488

Click here to see full error output

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-fd305a920de5> in record_dataset(self, listings, tos, _filter)
    348         try:
--> 349             pd.concat([pd.read_parquet(f'{self.data_dir}{self.location_alias}.parquet'), 
    350                        pd.DataFrame(data, columns=self.names)], axis=0).to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False)

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
    459     return impl.read(
--> 460         path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
    461     )

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    217             storage_options=storage_options,
--> 218             mode="rb",
    219         )

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
     81         # this branch is used for example when reading from non-fsspec URLs
---> 82         handles = get_handle(path_or_handle, mode, is_text=False)
     83         fs = None

~\Anaconda3\envs\scrape\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    650             # Binary mode
--> 651             handle = open(handle, ioargs.mode)
    652         handles.append(handle)

FileNotFoundError: [Errno 2] No such file or directory: 'data/niseko_hokkaido_prefecture.parquet'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Tag'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-1-fd305a920de5> in <module>
    457 
    458     # execute delayed scrapes
--> 459     compute(*collection)
    460 
    461     print(f'runtime: {time.time() - start}')

~\Anaconda3\envs\scrape\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    559         postcomputes.append(x.__dask_postcompute__())
    560 
--> 561     results = schedule(dsk, keys, **kwargs)
    562     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    563 

~\Anaconda3\envs\scrape\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     82         get_id=_thread_get_id,
     83         pack_exception=pack_exception,
---> 84         **kwargs
     85     )
     86 

~\Anaconda3\envs\scrape\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    485                         _execute_task(task, data)  # Re-execute locally
    486                     else:
--> 487                         raise_exception(exc, tb)
    488                 res, worker_id = loads(res_info)
    489                 state["cache"][key] = res

~\Anaconda3\envs\scrape\lib\site-packages\dask\local.py in reraise(exc, tb)
    315     if exc.__traceback__ is not tb:
    316         raise exc.with_traceback(tb)
--> 317     raise exc
    318 
    319 

~\Anaconda3\envs\scrape\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    220     try:
    221         task, data = loads(task_info)
--> 222         result = _execute_task(task, data)
    223         id = get_id()
    224         result = dumps((result, id))

~\Anaconda3\envs\scrape\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

~\Anaconda3\envs\scrape\lib\site-packages\dask\utils.py in apply(func, args, kwargs)
     29 def apply(func, args, kwargs=None):
     30     if kwargs:
---> 31         return func(*args, **kwargs)
     32     else:
     33         return func(*args)

<ipython-input-1-fd305a920de5> in scrape_types(self, printout)
    416         if last_date_recorded != today:
    417             # default search
--> 418             self.scrape_search(self.base_link, f'{self.location_alias}', _filter='', printout=printout)
    419             self.scrape_search(f'{self.base_link}?superhost=true', f'{self.location_alias}_super_hosts', _filter='super_hosts', printout=printout)
    420 

<ipython-input-1-fd305a920de5> in scrape_search(self, base_link, search_alias, _filter, n_pages, printout)
    381             print(self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter))
    382         else:
--> 383             self.record_dataset(get_room_classes(base_link_page_1), tos=t, _filter=_filter)
    384 
    385         # get urls for other pages

<ipython-input-1-fd305a920de5> in record_dataset(self, listings, tos, _filter)
    363                         df[column] = df[column].fillna(False)
    364                     # convert column to expected dtype for parquet
--> 365                     df[column] = df[column].astype(self.dtypes[column])
    366                 # write new parquet file
    367                 df.to_parquet(f'{self.data_dir}{self.location_alias}.parquet', index=False)

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
   5872         else:
   5873             # else, only a single dtype is given
-> 5874             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5875             return self._constructor(new_data).__finalize__(self, method="astype")
   5876 

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

~\Anaconda3\envs\scrape\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1095     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
   1096         # Explicit copy, or required since NumPy can't view from / to object.
-> 1097         return arr.astype(dtype, copy=True)
   1098 
   1099     return arr.view(dtype)

ValueError: setting an array element with a sequence.

The text was updated successfully, but these errors were encountered:

gumdropsteve added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting column dtypes | TypeError: float() argument must be a string or a number, not 'Tag' #3

Setting column dtypes | TypeError: float() argument must be a string or a number, not 'Tag' #3

gumdropsteve commented Mar 27, 2021 •

edited

Loading

Setting column dtypes | TypeError: float() argument must be a string or a number, not 'Tag' #3

Setting column dtypes | TypeError: float() argument must be a string or a number, not 'Tag' #3

Comments

gumdropsteve commented Mar 27, 2021 • edited Loading

What?

Error in short...

Where?

Specific line

Area of events

Solution?

Code to reproduce & full error...

gumdropsteve commented Mar 27, 2021 •

edited

Loading