\$\begingroup\$
HTTP request is made, and a JSON string is returned, which needs to be parsed.
Example response:
{"urlkey": "com,practicingruby)/", "timestamp": "20150420004437", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246644200.21/warc/CC-MAIN-20150417045724-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9219", "mime": "text/html", "offset": "986953615", "digest": "DOGJXRGCHRUNDTKKJMLYW2UY2BSWCSHX"}{"urlkey": "com,practicingruby)/", "timestamp": "20150425001851", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246645538.5/warc/CC-MAIN-20150417045725-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9218", "mime": "text/html", "offset": "935932558", "digest": "LJKP47MYZ2KEEAYWZ4HICSVIHDG7CARQ"}{"urlkey": "com,practicingruby)/articles/ant-colony-simulation?u=5c7a967f21", "timestamp": "20150421081357", "status": "200", "url": "https://practicingruby.com/articles/ant-colony-simulation?u=5c7a967f21", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246641054.14/warc/CC-MAIN-20150417045721-00029-ip-10-235-10-82.ec2.internal.warc.gz", "length": "10013", "mime": "text/html", "offset": "966385301", "digest": "AWIR7EJQJCGJYUBWCQBC5UFHCJ2ZNWPQ"}My code:
result = Net::HTTP.get(URI("http://index.commoncrawl.org/CC-MAIN-2015-18-index?url=#{url}&output=json")).split("}")result.each do |res| break if res == "\n" #need to add back braces because we used it to split the various json hashes from the http request res << "}" to_crawl = JSON.parse(res) puts to_crawlendIt works, but I'm sure there is a much better way to do it, or at least a better way to write the code.
\$\endgroup\$
2 Answers2
\$\begingroup\$\$\endgroup\$
Thisbody.split('{'}) is doing you a disservice, as it destroys the structure of the response. Split it by lines instead:
body = Net::HTTP.get(...)data = body.lines.map { |line| JSON.parse(line) }\$\begingroup\$\$\endgroup\$
0Usefaraday
require 'faraday'conn = Faraday.new("http://index.commoncrawl.org/") do |faraday| faraday.request :url_encoded # form-encode POST params faraday.adapter Faraday.default_adapter # make requests with Net::HTTPendresponse = conn.get("/CC-MAIN-2015-18-index?url=#{url}&output=json")parsed = JSON.parse(response.body)You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.

