2
\$\begingroup\$

HTTP request is made, and a JSON string is returned, which needs to be parsed.
Example response:

{"urlkey": "com,practicingruby)/", "timestamp": "20150420004437", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246644200.21/warc/CC-MAIN-20150417045724-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9219", "mime": "text/html", "offset": "986953615", "digest": "DOGJXRGCHRUNDTKKJMLYW2UY2BSWCSHX"}{"urlkey": "com,practicingruby)/", "timestamp": "20150425001851", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246645538.5/warc/CC-MAIN-20150417045725-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9218", "mime": "text/html", "offset": "935932558", "digest": "LJKP47MYZ2KEEAYWZ4HICSVIHDG7CARQ"}{"urlkey": "com,practicingruby)/articles/ant-colony-simulation?u=5c7a967f21", "timestamp": "20150421081357", "status": "200", "url": "https://practicingruby.com/articles/ant-colony-simulation?u=5c7a967f21", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246641054.14/warc/CC-MAIN-20150417045721-00029-ip-10-235-10-82.ec2.internal.warc.gz", "length": "10013", "mime": "text/html", "offset": "966385301", "digest": "AWIR7EJQJCGJYUBWCQBC5UFHCJ2ZNWPQ"}

My code:

result = Net::HTTP.get(URI("http://index.commoncrawl.org/CC-MAIN-2015-18-index?url=#{url}&output=json")).split("}")result.each do |res|    break if res == "\n"    #need to add back braces because we used it to split the various json hashes from the http request    res << "}"    to_crawl = JSON.parse(res)    puts to_crawlend

It works, but I'm sure there is a much better way to do it, or at least a better way to write the code.

askedJun 30, 2015 at 18:41
Wenqin Ye's user avatar
\$\endgroup\$

2 Answers2

3
\$\begingroup\$

Thisbody.split('{'}) is doing you a disservice, as it destroys the structure of the response. Split it by lines instead:

body = Net::HTTP.get(...)data = body.lines.map { |line| JSON.parse(line) }
answeredJun 30, 2015 at 19:01
tokland's user avatar
\$\endgroup\$
2
\$\begingroup\$

Usefaraday

require 'faraday'conn = Faraday.new("http://index.commoncrawl.org/") do |faraday|  faraday.request :url_encoded             # form-encode POST params  faraday.adapter Faraday.default_adapter  # make requests with Net::HTTPendresponse = conn.get("/CC-MAIN-2015-18-index?url=#{url}&output=json")parsed = JSON.parse(response.body)
answeredJul 10, 2015 at 7:12
bogem's user avatar
\$\endgroup\$
0

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.