Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Initialize a search engine with tweets from csv. Use logic expression like AND OR NOT to form a query and search over it. Non-query expression is also supported.

NotificationsYou must be signed in to change notification settings

NBsyxx/tweet-query-search-python3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How to run it?

Just as the default, init the search engine by process tweetsFor single search use

TweetIndex.search("your | ( query & here )")

ti=TweetIndex()ti.process_tweets(list_of_tweets)ti.search("neeva hello")

query syntax

Where we use syntax?: it is the string that user put in search function of TweetIndex class, a most basic structure is as below

ti=TweetIndex()ti.process_tweets(list_of_tweets)ti.search("query")# we can go with a queryti.search("this & ( ( !him & know ) | ( very & because ) )")

Approach

Within Preprocessing:

Create 2 hash:

word -> {times} set:we can use word to find the times(time is assumed unique, so it indexes the tweet text)

time -> tweet str: we can use the time to index tweet text

Search Process

Since the infix query expression includes "(",")", it helps to use postfix: we first convert query to postfix

defparse_infix_to_postfix(self,infix):stack= []# initialization of empty stackpostfix= []forcharacterininfix:ifcharacternotinself.Operators:# if an operand append in postfix infixpostfix.append(self.word_to_time[character])elifcharacter=='(':# else Operators push onto stackstack.append('(')elifcharacter==')':whilestackandstack[-1]!='(':postfix.append(stack.pop())stack.pop()else:whilestackandstack[-1]!='('andself.Priority[character]<=self.Priority[stack[-1]]:postfix.append(stack.pop())stack.append(character)whilestack:postfix.append(stack.pop())returnpostfix

Then we evaluate the postfix expression

defeval_postfix(self,postfix):stack= []forcurinpostfix:ifisinstance(cur,set):stack.append(cur)ifcurinself.Operators:top1=stack.pop()top2=stack.pop()stack.append(self.eval(top1,top2,cur))returnlist(stack[0])

Sort the output based on its value(time)and return all candidates

The search workflow for query is as follow:

defsearch(self,query:str)->List[Tuple[str,int]]:list_of_words=query.split(" ")list_of_words=self.parse_infix_to_postfix(list_of_words)list_of_times=self.eval_postfix(list_of_words)list_of_times.sort(reverse=True)return [(self.list_of_tweets[t],t)fortinlist_of_times[0:5]]

Design decisions, tradeoffs, assumptions

I decided to use memory for speed, therefore a lot of dictionaries in the code.The whole application is built over the assumptions that time is unique to tweet.User can input any string for the search, but everything will be processed according to rules.

Complexity analysis

Time complexityWe have N tweets, average tweet length L in words, and the query length Q, Average occurance for each word in all tweets C

The process_tweets() will take O(NL) for preprocessingThe query will take O(QC) for searching

Spatial complexityThe preprocessing will take O(NLC) memory spaceSearch takes negligible space.

For the starter code, time complexity of preprocessing is O(NL) and it will take O(NLQ) for searching (very slow), it only takes O(NL) spaces

code updated tests, and time bench marks

query syntax

Where is syntax: it is the string that user put in search function of TweetIndex class, a most basic structure is as below

ti=TweetIndex()ti.process_tweets(list_of_tweets)ti.search("query")

TYPE1 non-queries: words without "&" and "|" will be parsed as non-queries, and we return the intersaction of each words, we allow the use of "!" to represent sentences not having this words

TYPE2 queries: words can be with "&" and "|", but it do require space between each word. "&" means intersaction and "|" means union, we allow the use of "!" to represent sentences not having this words. The default order for processing is prefix which is from left to right, we can adjust the order by adding "("")"

code showcases and tests

** testing non-queries **print(ti.search("neeva this him"))return: [('that we him this neeva know', 9102), ('people him this neeva for', 8383), ('what by of special him this neeva a', 8128), ('neeva their him this see thing they', 7644), ('can special people him this neeva be who in', 5690)]print(ti.search("neeva this him !know"))return:  [('people him this neeva for', 8383), ('what by of special him this neeva a', 8128), ('neeva their him this see thing they', 7644),('can special people him this neeva be who in', 5690), ('take of special him this neeva', 4297)]print(ti.search("brother"))return:[] ** testing queries **print(ti.search("neeva & this & ( ( !him & know ) | ( very & because ) )"))return:  [('because very this neeva know', 9969), ('special it this neeva know could', 9314), ('that because very when than this neeva', 7258),('time year special like those this neeva she for know', 7097), ('time it when this neeva know a', 6805)]** testing mixing expression **print(ti.search("neeva | this & him"))return:  since it is postfix parsing so it goes from left to right   ** testing random things  **print(ti.search("neeva | & him"))stdout: Query Syntax Errorreturn: [('', -1)]print(ti.search("| & "))stdout: Query Syntax Errorreturn:  [('', -1)]print(ti.search("neeva&hello"))it will read it and parse it as a query sentence since there's no "neeva&hello"return: []

on 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHzin terms of bench marks

for starter versionit takes 0.002s to preprocess 10000 tweet

it takes 3.05s to run 1000 times of

ti.search("neeva hello")

for my versionit takes 0.03s to preprocess 10000 tweet

it takes 0.14s to run 1000 times of

ti.search("neeva hello")

it takes 0.21s to run 1000 times of

ti.search("neeva & this & ( ( !him & know ) | ( very & because ) )")

About

Initialize a search engine with tweets from csv. Use logic expression like AND OR NOT to form a query and search over it. Non-query expression is also supported.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp