ProjectWorst/Excel-Documents-CorrelationPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

Instead of making comparisons between two Excel documents reading line-by-line, this program utilizes data manipulation techniques to make correlations between Excel documents and creates a new Excel document based on these correlations.

License

MIT license

2 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Document Correlation.py		Document Correlation.py
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Excel Documents Correlation Program

I created this program to determine correlations from two different Excel documents. The data is then exported as a single Excel document for analysis. It utilizes various libraries for data manipulation, natural language processing, and machine learning.

Prerequisites

Before running the program, ensure that you have the following libraries installed:

pandas: for data manipulation
nltk: the Natural Language Toolkit library used for natural language processing tasks
sklearn: scikit-learn library, which provides various tools for machine learning and vectorization

You can install these libraries using pip:

pip install pandas nltk scikit-learn

In addition, the program requires downloading necessary resources from the 'nltk' library, specifically the 'stopwords' corpus and the 'punkt' tokenizer. To download these resources, run the following commands:

importnltknltk.download('stopwords')nltk.download('punkt')

Usage

Import the required libraries:

importpandasaspdimportnltkfromnltk.corpusimportstopwordsfromnltk.tokenizeimportword_tokenizeimportstringfromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.metrics.pairwiseimportcosine_similarity

Load data from two Excel files into separate dataframes (df1 anddf2):

df1=pd.read_excel("path to file")df2=pd.read_excel("path to file")

Define thepreprocess_text function to preprocess the text data:

defpreprocess_text(text):# Remove punctuationtext=text.translate(str.maketrans('.',',',string.punctuation))# Convert to lowercase.text=text.lower()# Tokenize text into words.tokens=word_tokenize(text)# Remove stop words.stop_words=set(stopwords.words('english'))tokens= [wordforwordintokensifwordnotinstop_words]# Join the words back into a single string.preprocessed_text=' '.join(tokens)returnpreprocessed_text

Apply thepreprocess_text function to specific columns of bothdf1 anddf2 dataframes:

df1['column1']=df1['column1'].apply(preprocess_text)df1['column2']=df1['column2'].apply(preprocess_text)df1['column3']=df1['column3'].apply(preprocess_text)df1['column4']=df1['column4'].apply(preprocess_text)df1['column5']=df1['column5'].apply(preprocess_text)# Same concept is applied now to df2df2['column1']=df2['column1'].apply(preprocess_text)df2['column2']=df2['column2'].apply(preprocess_text)df2['column3']=df2['column3'].apply(preprocess_text)df2['column4']=df2['column4'].apply(preprocess_text)df2['column5']=df2['column5'].apply(preprocess_text)

Initialize a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer:

tfidf_vectorizer=TfidfVectorizer()

Fit the vectorizer on the preprocessed text data fromdf1 and transform the text data fromdf2:

tfidf_matrix1=tfidf_vectorizer.fit_transform(df1['column1']+' '+df1['column2']+' '+df1['column3']+' '+df1['column4']+' '+df1['column5'])tfidf_matrix2=tfidf_vectorizer.transform(df2['column1']+' '+df2['column2']+' '+df2['column3']+' '+df2['column4']+' '+df2['column5'])

Calculate the cosine similarity between each pair of requirements:

similarity_matrix=cosine_similarity(tfidf_matrix1,tfidf_matrix2)

Set a similarity threshold value to determine correlated requirements:

similarity_threshold=0.35

Find pairs of correlated requirements and store them in thecorrelated_requirements list:

correlated_requirements= []foriinrange(similarity_matrix.shape[0]):max_similarity_idx=similarity_matrix[i].argmax()max_similarity=similarity_matrix[i,max_similarity_idx]ifmax_similarity>=similarity_threshold:correlated_requirements.append((i,max_similarity_idx))

Identify uncorrelated rows indf2 and store their indices in theuncorrelated_rows list:

uncorrelated_rows= []forrow_idxinrange(len(df2)):ifrow_idxnotin [req2_idxfor_,req2_idxincorrelated_requirements]:uncorrelated_rows.append(row_idx)

Generate a report by retrieving the details of correlated requirements:

report= []forreq1_idx,req2_idxincorrelated_requirements:req1_details=df1.iloc[req1_idx]req2_details=df2.iloc[req2_idx]report.append((req1_details,req2_details))

Export the report to an Excel file:

FY_column_list=list(df1.columns[1:])+list(df2.columns[1:])report_data= []forreq1_details,req2_detailsinreport:req1_column1=req1_details['column1']req2_column1=req2_details['column1']req1_column2=req1_details['column2']req2_column2=req2_details['column2']req1_column3=req1_details['column3']req2_column3=req2_details['column3']req1_impactifnotfunded=req1_details['column4']req2_impactifnotfunded=req2_details['column4']req1_costcenter=req1_details['column5']req2_costcenter=req2_details['column5']report_data.append(["Correlated"]+req1_details[1:]+req2_details[1:])forrow_idxinuncorrelated_rows:req_details=df2.iloc[row_idx].tolist()empty_values= [""]*len(df1.columns[1:])report_data.append(["Uncorrelated"]+empty_values+req_details[1:])FY_column_list_with_flag= ["Correlation"]+FY_column_listreport_df=pd.DataFrame(report_data,columns=FY_column_list_with_flag)report_df.to_excel("path to file",index=False)

Make sure to replace the placeholder "path to file" with the actual file paths where you want to read the input Excel files and export the report.

Customization

You can customize the program by adjusting the following parameters:

File paths: Replace "path to file" with the actual paths to the input Excel files and the desired output report file.
Similarity threshold: Modify thesimilarity_threshold value to control the correlation threshold for requirements.
Columns: Update the column names used in the program (column1,

column2,column3,column4,column5) to match the column names in your Excel files.

Feel free to explore and adapt the code according to your specific requirements.

About

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Excel Documents Correlation Program

Prerequisites

Usage

Customization

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

ProjectWorst/Excel-Documents-Correlation

Folders and files

Latest commit

History

Repository files navigation

Excel Documents Correlation Program

Prerequisites

Usage

Customization

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages