Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html_live sessions do not close and accumulate causing memory to crash #422

Closed
alireza5969 opened this issue Aug 27, 2024 · 7 comments · Fixed by #429
Closed

read_html_live sessions do not close and accumulate causing memory to crash #422

alireza5969 opened this issue Aug 27, 2024 · 7 comments · Fixed by #429
Labels
bug an unexpected problem or unintended behavior live 🐤

Comments

@alireza5969
Copy link

alireza5969 commented Aug 27, 2024

Dear {{tidyverse}} / {{rvest}} community,

I'm not sure if this is a bug or a problem that I can not find the solution for.

I try to read about 1000 pages with read_html_live() in a for loop. Naturally,I expect each page / session (I'm sorry if I'm not using the correct technical term) to be closed when a new one is called. However, after a while, when the machine has read 50-100 pages, the memory crashes.

When I look at task manage, I see all chrome is severely disrupting the memory (see image below).
image

FYI, this is the code that I'm using:

for (i in strt_n:nrow(list)) {
    print(i)
    
    page <- NA
    attempts <- 0
    
    while (!is.environment(page)) {
        
        attempts <- attempts + 1
        page <- tryCatch({
            read_html_live(list[[i, "url"]])
        },
        error = function(e) {
            print("error")
            if (attempts %% 10 == 0) {beepr::beep()}
            Sys.sleep(3)
            return(NA)
        })
    }
    
    
    Sys.sleep(2)
    
    content <- page %>% 
        html_elements(".oneLineText__oneLineText____Igu4") %>% 
        html_text2()
}

Currently, my work around is this code, which I add it at the end of every 100 loops. But it makes the script very slow.

if (i %% 100 == 0) {
  system("taskkill /IM chrome.exe /F")
}
@hadley
Copy link
Member

hadley commented Oct 21, 2024

Do you see the problem with this simpler reprex?

library(rvest)

for (i in 1:100) {  
  page <- read_html_live("https://hadley.nz")
}

Does adding an explicit gc() at the end of each iteration make any difference?

@hadley hadley added bug an unexpected problem or unintended behavior live 🐤 labels Oct 21, 2024
@alireza5969
Copy link
Author

alireza5969 commented Oct 23, 2024

Thank you for your response.
Yes, the issue is reproducible with the code you provided.
No, gc() did not help with the issue.
Here is a screenshot of the Windows Task Manager after 100 for loops:
image

sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Tehran
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_1.0.4

loaded via a namespace (and not attached):
 [1] later_1.3.2       R6_2.5.1          fastmap_1.2.0     websocket_1.4.1   magrittr_2.0.3    glue_1.7.0        lifecycle_1.0.4   ps_1.7.6          xml2_1.3.6       
[10] promises_1.3.0    cli_3.6.3         processx_3.8.4    chromote_0.3.1    compiler_4.3.3    httr_1.4.7        rstudioapi_0.16.0 tools_4.3.3       Rcpp_1.0.12      
[19] rlang_1.1.4       jsonlite_1.8.8   

@hadley
Copy link
Member

hadley commented Oct 24, 2024

I bet this is going to be a windows specific problem 😞

@wch
Copy link
Member

wch commented Oct 24, 2024

@alireza5969 In the latest screenshot, I see it says "Google Chrome (86)". I don't have a Windows machine handy -- does that mean there are 86 tabs/windows open? It may be counting your regular (visible) tabs in that number.

Also, I think 1.7GB is not actually a lot of memory for Chrome to consume when you have multiple tabs open.

If that 86 does represent the number of open tabs and you do not have that many visible open tabs, then it may be the case that rvest is opening many tabs and not closing them right away.

Can you check if that number is the number of open tabs -- does it increase when you open a new tab? And also check how many visible tabs you have, as opposed to the invisible headless ones created by rvest/chromote.

@hadley
Copy link
Member

hadley commented Oct 24, 2024

@alireza5969 could you please try installing pak::pak("r-lib/rvest#429"), restarting R, and then seeing if the problem goes away?

@alireza5969
Copy link
Author

@wch

In the latest screenshot, it shows "Google Chrome (86)." I don’t have access to a Windows machine right now—does this mean there are 86 tabs/windows open? It might be counting your regular (visible) tabs in that figure.

To be honest, I’m not entirely sure what this indicates! After a fresh session following a restart, when I visit this page with Chrome, I see varying counts (like 14 or 22). When I open a new tab (for instance, google.com), the number jumps to between 19 and 27. So, I suspect it’s not accurately reflecting active or visible tabs.

Also, I think that 1.7GB isn’t a lot of memory usage for Chrome with multiple tabs open.

I agree, it’s not. However, it’s currently using 73% of my memory (and I actually have decent RAM!). But for my tasks, I sometimes need to scrape over 5K webpages! That’s when it really becomes a concern.

rvest is opening multiple tabs and not closing them immediately.

Yes, I believe that’s the case.

Does the memory usage increase when you open a new tab?

Yes, it goes up with the number of open tabs (or potentially with the workload).

All the examples above were with Chrome, without using rvest.


I'm sorry, @hadley! I can’t install pak::pak("r-lib/rvest#429") because I’m encountering the following error:

Error:                                     
! error in pak subprocess
Caused by error: 
! Could not solve package dependencies:
* r-lib/rvest#429: ! pkgdepends resolution error for
r-lib/rvest#429.

But, I was able to install it with this one: pak::pak("tidyverse/rvest#429")

This is what it looks like when I run:

for (i in 1:100) {  
  print(i)
  page <- read_html_live("https://hadley.nz")
}

image

I think you did it @hadley 👏🏻😌
Thanks a lot!

@hadley
Copy link
Member

hadley commented Oct 25, 2024

Oops, sorry for the wrong org name, and thanks for verifying that the fix works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior live 🐤
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants