8
\$\begingroup\$

Why I made it

I’m building a voice assistant and thought it would be great if it could deliver news updates.

What it does

It cleans up the user’s query, determines whether it refers to a news category or a specific search term, and then calls theNewsAPI to fetch the most recent matching headline. After receiving the response, it extracts and tidies the article’s title, author, source, and description before returning a conversational summary which thetext-to-speech module of my voice assistant (which I haven't posted yet) can say out loud with ease.

Code

Here is the Python code:

import requestsimport json5def NewsConcerning(text):    """ Search for news concerning the text """    text = (        str(text)        .replace("news", "")        .replace("new", "")        .replace("for ", "")        .replace("concerning", "")        .replace("can ", "")        .replace("you ", "")        .replace("give ", "")        .replace("me ", "")        .replace("the ", "")        .replace("most ", "")        .replace("recent ", "")        .replace("in ", "")        .replace("category ", "")        .replace(" currently ", "")        .replace("current ", "")        .replace("now ", "")        .replace("what ", "")        .replace("whats ", "")        .replace(" is ", "")        .replace("about ", "")        .replace(" ", "")        .replace("today", "")    )    if "business" in text:        text = "category=business"    elif "entertainment" in text:        text = "category=entertainment"    elif "general" in text:        text = "category=general"    elif "health" in text:        text = "category=health"    elif "science" in text:        text = "category=science"    elif "sports" in text:        text = "category=sports"    elif "technology" in text:        text = "category=technology"    elif text == "":  # Eg input has been fully stripped and nothing is left        text = "category=general"    else:  # If specific text is queried:        text = "q=" + text    url = (        "https://newsapi.org/v2/top-headlines?"        + text        + "&sortBy=publishedAt&pageSize=1&apiKey=THE_API_KEY_HAS_BEEN_REMOVED"    )    response = requests.get(url)    parsed_data = json5.loads(response.text)    #print(parsed_data)    try:        #print(text)        if "articles" in parsed_data and parsed_data["articles"]:            print("<ALL OK>")        first_article = parsed_data["articles"][0]        name = first_article["source"]["name"]        author = first_article["author"]        title = first_article["title"]        desc = first_article["description"]        title = (            title.replace(name, "")            .replace(" - ", " ")            .replace("LISTEN | ", "")            .replace(                " | Opinion",                ". I have also come to notice how this is an opinion-based text.",            )            .replace(" | ", " ")            .replace("Watch Live: ", "live, ")            .replace("!!!!!!", ", followed by six exclamation marks")            .replace("!!!!!", ", followed by five exclamation marks")            .replace("!!!!", ", followed by four exclamation marks")            .replace("!!!", ", followed by three exclamation marks")            .replace("!!", ", followed by two exclamation marks")            .replace("\n", "")        )        if author:            if author.lower() != name.lower():                author = author.replace(",", "and")                author = " by " + author                title = title.replace(author, "")            else:                author = ""        else:            author = ", though I did not find who wrote it"        if desc == "":            desc = "Sadly, I could not find any description."        else:            content = desc.split("<p>")  # In case description comes in HTML format            if len(content) > 1:                desc = content[1]            desc = desc.replace("&quot;", "").replace(".com", "")        if len(name):            name = "on " + name        return f"\n\nHere is the recently published news I found {name}{author}:\nThe title to it is {title}... \n{desc}"    except IndexError:        text = text.replace("q=", "").replace("category=", "")        return f"I could not find anything concerning what you asked for. Mabe I misheard, were you asking about {text}? Try again if not."    except KeyError:        return "Whoops! Looks like you have been very interested in the news today! I'm sorry to say that my daily request limit, one hundred per day or fifty per twelve hours, has been reached."i = input("Enter what you would ask of the voice assistant (eg: news concerning bitcoin): ")print(NewsConcerning(i))

Critique request

Please, tell me anything that comes to mind.

askedNov 24 at 14:16
Chip01's user avatar
New contributor
Chip01 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.Check out ourCode of Conduct.
\$\endgroup\$

5 Answers5

13
\$\begingroup\$

Request

Instead of:

url = (    "https://newsapi.org/v2/top-headlines?"    + text    + "&sortBy=publishedAt&pageSize=1&apiKey=THE_API_KEY")response = requests.get(url)

I find this more legible:

url = "https://newsapi.org/v2/top-headlines"params = {    "category": "general",    "sortBy": "publishedAt",    "pageSize": 1,    "apiKey": "THE_API_KEY"}response = requests.get(url, params=params)

NB:

  • You can userequests.session() instead ofrequests.get...
  • Always check thestatus code. It should normally always be 200. Otherwise, the rest of the program is going to fail or at the very least yield unexpected results.
  • An API key should not be hardcoded. Use a .env file instead.

HTML parsing

This approach does not scale well:

desc = desc.replace("&quot;", "").replace(".com", "")

Who knows how many more HTML tags are going to show up over time.Use a lib like Beautifulsoup instead. Or lxml would be sufficient.

If you are always expecting a JSON response, it would be appropriate to build a small function that retrieves content from a URL, based on one or more parameters you supply (notably the category). Then you can have another dedicated function for parsing the JSON response. The idea is:separation of responsibility. One function should do one thing and do it well.

Also, isolating concerns makes debugging easier. For example, if the JSON response is invalid, you can jump straight to the function that is responsible for retrieving the content.

Securing credentials

The API key will most likely show up in web server logs as it is part of the URL. Whenever possible, transmit it differently, usingheaders for instance.

Thedocumentation states that this method is supported:

You can attach your API key to a request in one of three ways:

Via the apiKey querystring parameter.
Via the X-Api-Key HTTP header.
Via the Authorization HTTP header. Including Bearer is optional, andbe sure not to base 64 encode it like you may have seen in otherauthentication tutorials.

We strongly recommend the either of last two so that your API key isn't visible to others in logs or via request sniffing.

(emphasis is mine)

infinitezero's user avatar
infinitezero
1,1125 silver badges14 bronze badges
answeredNov 24 at 19:04
Kate's user avatar
\$\endgroup\$
10
\$\begingroup\$

Naming

NewsConcerning is styled in the wayPEP 8 would tell us to style a class name. As function it should be in snake case:news_concerning.

String replacements

Many of the strings you replace with spaces are problematic. Replacing"in " with"", for instance, would change"vein " to"ve".

To address this you can utilize word boundaries (\b) in a regular expression.

DRY

You callstr.replace several times. Instead create a list of strings, join them with| and turn them into a regular expression that replaced any of them with"".

Taking these into consideration:

import redef NewsConcerning(text):    """ Search for news concerning the text """    substrings = [        "news", "new", "for ", "concerning", "can ",        "you ", "give ", "me ", "the ", "most ", "recent ",        "in ", "category ", " currently ", "current ",         "now ", "what ", "whats ", " is ", "about "         " ", "today"    ]    pat = re.compile("|".join("\\b" + re.escape(s) + "\\b" for s in substrings))    text = pat.sub("", str(str))

This also removes the possibility for one call tostr.replace to change the string so that a subsequent call will be triggered that otherwise wouldn't.

More repetition

Your conditional has more repetition.

    categories = (        "business", "entertainment", "general", "health",        "science", "sports"    )      cat_in_text = next((cat for cat in categories if cat in text), None)    if cat_in_text:        text = f"category={cat_in_text}"    elif text == "":  # Eg input has been fully stripped and nothing is left        text = "category=general"    else:  # If specific text is queried:        text = "q=" + text

URLs

But there's a problem with this approach. You should next be escaping thistext variable when you insert it into your URL, but you can't without escaping that= you inserted intentionally.

__main__ guard

The final bit of code should be wrapped in a__main__ guard.

if __name__ == "__main__":    i = input("Enter what you would ask of the voice assistant (eg:news concerning bitcoin): ")    print(NewsConcerning(i))

Case sensitivity

As it stands, your program doesn't work very well if the user inputs mixed-case words. The fix is easy: convert the input to lowercase.

You can either convert the input immediately to lowercase, or let theNewsConcerning function do it.

if __name__ == "__main__":    i = input("Enter what you would ask of the voice assistant (eg:news concerning bitcoin): ")    print(NewsConcerning(lower(i)))

Exception handling

Yourtry ismiles from your actual exception handlers. The code in yourtry block should be as absolutely minimal as possible.

Boolean tests and false-y values

In a couple of places you do boolean tests on the length of lists or on whether strings are empty. You should be aware that empty strings are false in Python, as are empty lists.

answeredNov 24 at 14:51
Chris's user avatar
\$\endgroup\$
5
\$\begingroup\$

String Replacements and Matching

You should use a case-insensitive regular expression using there.I flag to do your matching. For example:

import re...matching_words = (    'business',    'entertainment',    ...    'general')matching_rex = re.compile(fr'\b({"|".join(matching_words)})\b', flags=re.I)m = matching_rex.search(text)if m:    ...

If any of the words inmatching_words contain characters that have special meaning in regular expressions (e.g. '.', '|', '(', etc.), then each word should be escaped either manually inmatching_words or, preferably by callingre.escape on each word to be matched.

A case-insensitive regular expression can be used withre.sub doing your replacements of the input text. For the sake of efficiency, you should only do the replacement logic if you are unable to find a keyword match.

Docstrings and Comments

These would be helpful.

answeredNov 24 at 15:16
Booboo's user avatar
\$\endgroup\$
5
\$\begingroup\$

single responsibility

TheSRPsays that a function should doone thing well.

The expression that replaces "concerning" and other stop wordsshould be in a helper function.For one thing, that would make it very easy tounit test.

consistency

        .replace(" currently ", "")        .replace("current ", "")        .replace("now ", "")        .replace("what ", "")        .replace("whats ", "")        .replace(" is ", "")

It's a bit worrying that"new" has no SPACE characters,"for " ends with one, and a few words are surrounded by them.I would have a hard time writing a docstring that explainsjust what cleaned up result we're computing here.That is to say, there doesn't appear to be a proper specificationfor the cleanup behavior.

Also, if the inputtext is on the long side,then repeatedly scanning it for two dozen words is scaling poorly.Given that we already have a bunch of stop words, I wouldanticipate that in a few months we might accumulate, say,twice as many. That's just how code maintenance tends to go.

Consider putting a stake in the ground:"We parsetext as English words, and remove stopwords."And then you might start to implement that spec withimport nltk.That package offersnltk.word_tokenize(), which is robustand convenient.

Or you might go a bit further, such as saying you will strippunctuation, as you would apparently want a "what's"\$\rightarrow\$ "whats"conversion. You might even lemmatize, so a common word stemwould match e.g. both "current" and "currently".

Consider putting your stop words in aset(), so that whenyou loop overfor word in words: you can probe set membershipin\$O(1)\$ constant time, rather than consuming timeproportional to number of stopwords.

break out a helper

    if "business" in text:        text = "category=business"    ...    elif "technology" in text:        text = "category=technology"

This should be afor loop over the enumerated categories,with an assignment oftext = f"category={category}",rather than the copy-pasta we see here.Any time you're tempted to start leaning on that CMD-V paste key,take a breath, step back, and think about what the appropriatefunction signature should be, so you can parameterizeinstead pasting then editing.

Also, don't put a secret API key in source code, checked into git,and certainly don't publish it to the world.Prefer to obtain it from an env var withos.environ["API_KEY"]

comments

The "followed by six exclamation marks" thing is weird enough thatit deserves either a unit test which exercises that case,or a# comment or helper """docstring""" that explainsthe business Use Case.Apparently there's some site that is fond of exclamation marks,but it's unclear how that interferes with the proper functioningofNewsConcerning().Which brings us back to: write an explicit spec for the function,please, preferably as a docstring.Then some months from now a maintainer can decide whether itis still delivering correct results, even as news content andformatting continues to change.

Oh, andPEP 8asks that you spell itdef news_concerning():

check for error at the site of the error

    response = requests.get(url)    ...    except KeyError:        return "Whoops! ..."

That diagnostic message is very nice and helpful, thank you.But it's very odd to defer the check for so many dozensof lines down in the source code.We should not have to juggle"are we in error state here? / normal state?"while reading so much code that manipulatesthe abnormal text result.

The usual idiom is

    response = requests.get(url)    response.raise_for_status()

on the assumption that a 400- or 500-series error won't letus accomplish anything useful withresponse.text.Or you might examineresponse.status

answeredNov 24 at 18:31
J_H's user avatar
\$\endgroup\$
4
\$\begingroup\$

Just a high-level comment: A large part of the code consists of rule-based parsing of text (all the string-replace calls). This is extremely brittle (there are lots of cases that you won't cover) and error prone. Most importantly, it's not maintainable. It's perhaps good as a first stab, to get an idea about what you want to do, or build a rough first prototype, but it's practically impossible to build this out into a bigger program. I would advise to spent time at learning to use a language model (see the tutorials at the HuggingFace site), and either fine tune it or constrain it.

answeredNov 25 at 2:05
mudskipper's user avatar
New contributor
mudskipper is a new contributor to this site. Take care in asking for clarification, commenting, and answering.Check out ourCode of Conduct.
\$\endgroup\$
4
  • \$\begingroup\$I disagree. Rule-based parsing issignificantly easier to understand and maintain than a language model,provided that it's well-structured – which, as other answers have pointed out, it isn't. If you want to do more sophsticated NLP, I'd recommend a package likespacy, rather than jumping straight to the solution-looking-for-a-problem of an LLM.\$\endgroup\$CommentedNov 25 at 20:44
  • \$\begingroup\$I don't disagree that rule-based parsing is easier to understand. But it's totally unsuitable for handling free-flowing natural language text. The proof is in the pudding, the dev who wrote that code will notice this soon enough. (But spacy is a very nice library to get familiar with - also agree with that.)\$\endgroup\$CommentedNov 25 at 21:28
  • \$\begingroup\$I also agree, btw, that it is good to first get some familiarity with rule-based parsing before diving into LLMs.\$\endgroup\$CommentedNov 25 at 21:30
  • \$\begingroup\$Actually, rule-based parsing works quite well for voice interfaces: users tend to use a restricted grammar when interacting with them, and by careful observation this can be turned into a system of rules. I'd recommend re-running the studies every few years, to keep up with linguistic drift, but you can get alot of mileage out of the simple solutions even here.\$\endgroup\$CommentedNov 25 at 21:37

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.