Scripting browser-like tasks
curl can do almost every HTTP operation and transfer your favorite browsercan. It can actually do a lot more than so as well, but in this chapter wefocus on the fact that you can use curl to reproduce, or script, what youwould otherwise have to do manually with a browser.
Here are some tricks and advice on how to proceed when doing this.
Figure out what the browser does
This is really a necessary first step. Second-guessing what it does riskshaving you chase down the wrong problem rat-hole. The scientific approach tothis problem pretty much requires that you first understand what the browserdoes.
To learn what the browser does to perform a certain task, you can either readthe HTML pages that you operate on and with a deep enough knowledge you cansee what a browser would do to accomplish it and then start trying to do thesame with curl.
The slightly more effective way, that also works even for the cases when thepage is shock-full of obfuscated JavaScript, is to run the browser and monitorwhat HTTP operations it performs.
TheCopy as curl section describes how you canrecord a browser's request and easily convert that to a curl command line.
Those copied curl command lines are often not good enough though since theytend to copyexactly that request, while you probably want to be a tad bitmore dynamic so that you can reproduce the same operation and not just resendthe verbatim request.
Cookies
A lot of the web today works with a username and password login promptsomewhere. In many cases you even logged in a while ago with your browser butit has kept the state and keeps you logged in.
The logged-in state is almost always done by usingcookies.A common operation would be to first login and save the returned cookies in afile, and then let the site update the cookies in the subsequent command lineswhen you traverse the site with curl.
Web logins and sessions
The site at https://example.com/ features a login prompt. The login on the website is an HTML form to which you send aHTTP POST to. Savethe response cookies and the response (HTML) output.
Although the login page is visible (if you would use a browser) onhttps://example.com/, the HTML form tag on that page informs you about whichexact URL to send the POST to, using theaction
parameter.
In our imaginary case, the form tag looks like this:
<form action="login.cgi" method="POST"> <input type="text" name="user"> <input type="password" name="secret"> <input type="hidden" name="id" value="bc76"></form>
There are three fields of importance.user,secret andid. Thelast one, the id, is markedhidden
which means that it does not show up inthe browser and it is not a field that a user fills in. It is generated by thesite itself, and for your curl login to succeed, you need extract that valueand use that in your POST submission together with the rest of the data.
Send correct contents to the fields to the correct destination URL:
curl -d user=daniel -d secret=qwerty -d id=bc76 \ https://example.com/login.cgi -o out
Many login pages even send you a session cookie already when presenting thelogin, and since you often need to extract the hidden fields from the<form>
tag anyway, you could do something like this first:
curl -c cookies https://example.com/ -o loginform
You would often need an HTML parser or some scripting language to extract theid field from there and then you can proceed and login as mentioned above, butwith the added cookie loading (I am splitting the line into two lines to makeit more readable):
curl -d user=daniel -d secret=qwerty -d id=bc76 \ https://example.com/login.cgi -b cookies -c cookies -o out
You can see that it uses both-b
for reading cookies from the file and-c
to store cookies again, for when the server sends back updated cookies.
Always,always, add-v
to the command lines when working out thedetails. See also theverbose section for moredetails on that.
Redirects
It is common for servers to useredirects when respondingto a login POST. It is so common I would probably say it is rare that it isnot solved with a redirect.
You then just need to remember that curl does not follow redirectsautomatically. You need to instruct it to do this by adding the-L
commandline option. Adding that to the previous command line then makes the full onelook like:
curl -d user=daniel -d secret=qwerty -d id=bc76 \ https://example.com/login.cgi -b cookies -c cookies -L -o out
Post-login
In the above example command lines, we save the login response output in afile named 'out' and in your script you should probably verify that itcontains some text or something that confirms that the login is successful.
Once successfully logged in, get the files or perform the HTTP operations youneed and remember to keep using both-b
and-c
on the command lines to useand update the cookies.
Referer
Some sites verify that theReferer:
is actually identifying the legitimateparent URL when you request something or when you login or similar. You canthen inform the server from which URL you arrived by using-e https://example.com/
etc. Appending that to the previous login attempt thenmakes it:
curl -d user=daniel -d secret=qwerty -d id=bc76 \ https://example.com/login.cgi \ -b cookies -c cookies -L -e "https://example.com/" -o out
TLS fingerprinting
Anti-bot detections nowadays use TLS fingerprinting to figure out whether arequest is coming from a browser. Curl's fingerprint can vary depending on yourenvironment and most likely is different from those of browsers. Curl's CLIdoes not have options to change all the various parts of the fingerprint,however an advanced user can customize the fingerprint through the use oflibcurl and by compiling curl from source themselves.