Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SetUrl() does not update new URLs properly within same Session() object #1018

Open
kareemrt opened this issue Jan 19, 2024 · 1 comment
Open

Comments

@kareemrt
Copy link

Description

I am writing a web-scraper library with libcpr that cycles random proxies and headers on a GET request. My intended behavior is for other programs to call on force_connect() to perform different GET requests, while maintaining some information between all functions calls (e.g., Proxy / browser header variables, etc.)

When I perform a single GET request to a URL (e.g., URL 1), everything works correctly;
If I perform multiple GET requests to the same URL (URL 1), everything works correctly;
if I perform a single GET request to a URL (URL 1), then perform another GET request to a new URL (e.g., URL 2), the second Session.Get() call returns a response from URL 1 instead of URL 2.

This behavior can be verified with the last line of code (cout << url << " " << r.url << endl;). This prints both the url passed to the function, and the url used in the GET request.

This behavior remains whether I re-use a session object, create a new session object (i.e. remove 'static'), or omit the session and use Response objects only (though under-the-hood these seem similar as Response.Get() calls on Session).

My program uses many static variables because I want to maintain allocated memory between force_connect() calls; even if I remove static calls and re-declare variables, I encounter the same issue.

There are a lot of commented out code lines; these are potential solutions I tried (and failed with).

I am unsure why I am encountering this behavior; when I print 'session.GetFullRequestUrl()', it prints the PROPER url (URL 2) which is even stranger (it means part of the session object is updating and part of it is not).

Example/How to Reproduce

string force_connect(string url, int tries){

                      // ... (IP, header vars, objects defined, other extraneous code here)
static thread_local string proxy = "socks5h://" + info.creds + '@' + IP + ":1080";

                      // Initialize session() object, set URL
static thread_local cpr::Session session;
// session.SetUrl(url);                                  // Commented out: using string url instead of cpr::Url in SetUrl()
// session.SetOption(url);
static thread_local cpr::Url CURL;             
CURL = cpr::Url{url};
session.SetUrl(CURL)

                     // Assign proxy and headers to Session() object
static thread_local cpr::Header header;
header = cpr::Header{{"user-agent", hdr}};
session.SetProxies({{"http", proxy}, {"https", proxy}});
session.SetHeader(header);
                     
                     // Perform Get request
// cout << session.GetFullRequestUrl() << endl; // This updates properly and prints the PROPER url (i.e. URL 2)
// static thread_local cpr::Response r = cpr::Get(cpr::Url{url}, header, cpr::Proxies{{"http", proxy}, {"https", proxy}});
static thread_local cpr::Response r = session.get();
cout << url << " " << r.url << endl;                   // url SHOULD be = r.url, but r.url is not updating (i.e., URL 1)

}

Possible Fix

cpr::Session::SetUrl(const Url& url); takes a passed cpr::Url object and sets the private parameter 'url_' to the reference.

It sets correctly initially (that's how it reaches URL 1), but refuses to update when the same object pointer (or an entirely new one) is passed. Even when a new session and/or cpr::url object is created, I still encounter this behavior.

Looking into Session.Get() code, it appears the underlying call is to curl_easy_perform(), which reads the URL from a libcurl flag (curl_easy_set_opt(curl, CURLOPT_URL, url_.c_str())) that was set in Session::prepareCommon().

I don't know why Session.url_ is not updating; maybe it is and something is wrong in libcurl's code (I can't check using a debugger because this library is meant for my main program which was written in PYTHON, and the class member is private).

Either a modification-check or a copy-by-value approach could be potential solutions.

Where did you get it from?

Other (specify in "Additional Context/Your Environment")

Additional Context/Your Environment

  • OS: MacOS Ventura 13.1
  • Version: 1.10.5
  • Package-Manager / installation method: Homebrew
@COM8
Copy link
Member

COM8 commented Jan 20, 2024

@kareemrt thanks for reporting!
Based on a quick test, this looks to be a multithreading issue. Perhaps not everything is declared thread local. Might be an issue with how we create curl objects. Could you try comparing the pointer of *session.GetCurlHolder() if they are actually different.

As ref. The following works in a single threaded scenario:

TEST(SessionGetTests, GetMultipleTimes1) {
    Url url{server->GetBaseUrl() + "/hello.html"};
    Session session;
    session.SetUrl(url);
    std::string expected_text{"Hello world!"};

    Response response = session.Get();
    EXPECT_EQ(expected_text, response.text);
    EXPECT_EQ(url, response.url);
    EXPECT_EQ(std::string{"text/html"}, response.header["content-type"]);
    EXPECT_EQ(200, response.status_code);
    EXPECT_EQ(ErrorCode::OK, response.error.code);

    Url url2{server->GetBaseUrl() + "/url_post.html"};
    session.SetUrl(url2);
    session.SetPayload({{"x", "5"}});
    std::string expected_text2{
            "{\n"
            "  \"x\": 5\n"
            "}"};

    response = session.Post();
    EXPECT_EQ(expected_text2, response.text);
    EXPECT_EQ(url2, response.url);
    EXPECT_EQ(std::string{"application/json"}, response.header["content-type"]);
    EXPECT_EQ(201, response.status_code);
    EXPECT_EQ(ErrorCode::OK, response.error.code);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants