Movatterモバイル変換


[0]ホーム

URL:


Packt
Search iconClose icon
Search icon CANCEL
Subscription
0
Cart icon
Your Cart(0 item)
Close icon
You have no products in your basket yet
Save more on your purchases!discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Profile icon
Account
Close icon

Change country

Modal Close icon
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timerSALE ENDS IN
0Days
:
00Hours
:
00Minutes
:
00Seconds
Home> Data> Data Analysis> Practical Data Analysis Using Jupyter Notebook
Practical Data Analysis Using Jupyter Notebook
Practical Data Analysis Using Jupyter Notebook

Practical Data Analysis Using Jupyter Notebook: Learn how to speak the language of data by extracting useful and actionable insights using Python

Arrow left icon
Profile Icon Marc Wintjen
Arrow right icon
€8.98€23.99
Full star iconFull star iconFull star iconHalf star iconEmpty star icon3.9(9 Ratings)
eBookJun 2020322 pages1st Edition
eBook
€8.98 €23.99
Paperback
€29.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.98 €23.99
Paperback
€29.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature iconInstant access to your Digital eBook purchase
Product feature icon Download this book inEPUB andPDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature iconDRM FREE - Read whenever, wherever and however you want
Product feature iconAI Assistant (beta) to help accelerate your learning

Contact Details

Modal Close icon
Payment Processing...
tickCompleted

Billing Address

Table of content iconView table of contentsPreview book icon Preview Book

Practical Data Analysis Using Jupyter Notebook

Fundamentals of Data Analysis

Welcome and thank you for reading my book. I'm excited to share my passion for data and I hope to provide the resources and insights to fast-track your journey into data analysis. My goal is to educate, mentor, and coach you throughout this book on the techniques used to become a top-notch data analyst. During this process, you will get hands-on experience using the latest open source technologies available such as Jupyter Notebook and Python. We will stay within that technology ecosystem throughout this book to avoid confusion. However, you can be confident the concepts and skills learned are transferable across open source and vendor solutions with a focus on all things data.

In this chapter, we will cover the following:

  • The evolution of data analysis and why it is important
  • What makes a good data analyst?
  • Understanding data types and why they are important
  • Data classifications and data attributes explained
  • Understanding data literacy

The evolution of data analysis and why it is important

To begin, we should define what data is. You will find varying definitions but I would define data as the digital persistence of facts, knowledge, and information consolidated for reference or analysis. The focus of my definition should be the wordpersistence because digital facts remain even after the computers used to create them are powered down and they are retrievable for future use. Rather than focus on the formal definition, let's discuss the world of data and how it impacts our daily lives. Whether you are reading a review to decide which product to buy or viewing the price of a stock, consuming information has become significantly easier to allow you to make informed data-driven decisions.

Data has been entangled into products and services across every industry from farming to smartphones. For example, America's Grow-a-Row, a New Jersey farm to food bank charity, donated over 1.5 million pounds of fresh produce to feed people in need throughout the region each year, according to their annual report. America's Grow-a-Row has thousands of volunteers and uses data to maximize production yields during the harvest season.

As the demand for being aconsumer of data has increased, so has the supply side, which is characterized as theproducer of data. Producing data has increased in scale as the technology innovations have evolved. I'll discuss this in more detail shortly, but this large scale consumption and production can be summarized as big data. A National Institute of Standards and Technology report defined big data as consisting ofextensive datasets—primarily in the characteristics of volume, velocity, and/or variability—that require a scalable architecture for efficient storage, manipulation, and analysis.

This explosion of big data is characterized by the3Vs, which areVolume,Velocity, andVariety,and has become a widely accepted concept among data professionals:

  • Volume is based on the quantity of data that is stored in any format such as image files, movies, and database transactions, which are measured in gigabytes, terabytes, or even zettabytes. To give context, you can store hundreds of thousands of songs or pictures on one terabyte of storage space. Even more amazing than the figures is how much it costs you. Google Drive, for example, offers up to 5 TB (terabytes) of storage for free according to their support site.
  • Velocity is the speed at which data is generated. This process covers how data is both produced and consumed. For example, batch processing is how data feeds are sent between systems where blocks of records or bundles of files are sent and received. Modern velocity approaches are real time, streams of data where the data flow is in a constant state of movement.
  • Variety is all of the different formats that data can be stored in, including text, image, database tables, and files. This variety has created both challenges and opportunities for analysis because of the different technologies and techniques required to work with the data.

Understanding the 3Vs is important for data analysis because you must become good at being both a consumer and producer of data. The simple questions of how your data is stored, when this file was produced, where the database table is located, and in what format I shouldstore the output of my analysis of the data can all be addressed by understanding the 3Vs.

There is some debate—for which I disagree—that the 3Vs should increase to includeValue,Visualization, andVeracity. No worries, we will cover these concepts throughout this book.

This leads us to a formal definition of data analysis which is defined asa process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making, as stated inReview of business intelligence through data analysis.

Xia, B. S., & Gong, P. (2015). Review of business intelligence through data analysis. Benchmarking, 21(2), 300-311. doi:10.1108/BIJ-08-2012-0050

What I like about this definition is the focus on solving problems using data without the focus on which technologies are used. To make this possible there have been some significant technological milestones, the introduction of new concepts, and people who have broken down the barriers.

To showcase the evolution of data analysis, I compiled a few tables of key events from the years of 1945 until 2018 that I feel are the most influential. The following table is comprised of innovators such as Dr. E.F. Codd, who created the concept of a database to the launch of the iPhone device that spawned the mobile analytics industry.

The following diagram was collected from multiple sources and centralized in one place as a table of columns and rows and then visualized using this dendrogram chart. I posted the CSV file in the GitHub repository for reference:https://github.com/PacktPublishing/python-data-analysis-beginners-guide. Organizing the information and conforming the data in one place made the data visualization easier to produce and enables further analysis:

That process of collecting, formatting, and storing data in this readable format demonstrates the first step of becoming aproducer of data. To make this information easier to consume, I summarize these events by decades in the following table:

Decade

Count of Milestones

1940s

2

1950s

2

1960s

1

1970s

2

1980s

5

1990s

9

2000s

14

2010s

7

From the preceding summary table, you can see that the majority of these milestone events occurred in the 1990s and 2000s. What is insightful about this analysis is that recent innovations have removed the barriers of entry for individuals to work with data. Before the 1990s, the high purchasing costs of hardware and software restricted the field of data analysis to a relatively limited number of careers. Also, the costs associated with access to the underlying data for analysis were great. It typically required higher education and specialized careers in software programming or an actuary.

A visual way to look at this same data would be a trend bar chart, as shown in the following diagram. In this example, the height of the bars represents the same information as in the preceding table and theCount of Milestone events is on the left or they axis. What is nice about this visual representation of the data is that it is a faster way for theconsumer to see the upward pattern of where most events occur without scanning through the results found in the preceding diagram or table:

The evolution of data analysis is important to understand because now you know some of the pioneers who opened doors for opportunities and careers working with data, along with key technology breakthroughs, significantly reducing the time to make decisions regarding data both asconsumers andproducers.

What makes a good data analyst?

I will now break down the contributing factors that make up a good data analyst. From my experience, a good data analyst must be eager to learn and continue to ask questions throughout the process of working with data. The focus of those questions will vary based on the audience who are consuming the results. To be an expert in the field of data analysis, excellent communication skills are required so you can understand how to translate raw data into insights that can impact change in a positive way. To make it easier to remember, use the following acronyms to help to improve your data analyst skills.

Know Your Data (KYD)

Knowing your data is all about understanding the source technology that was used to create the data along with the business requirements and rules used to store it.Do research ahead of time to understand what the business is all about and how the data is used. For example, if you are working with a sales team, learn what drives their team's success. Do they have daily, monthly, or quarterly sales quotas? Do they do reporting for month-end/quarter-end that goes to senior management and has to be accurate because it has financial impacts on the company? Learning more about the source data by asking questions about how it will be consumed will help focus your analysis when you have to deliver results.


KYD is also about data lineage, which is understanding how the data was originally sourced including the technologies used along with the transformations that occurred before, during, and afterward. Refer back to the 3Vs so you can effectively communicate the responses from common questions about the data such as where this data is sourced from or who is responsible for maintaining the data source.

Voice of the Customer (VOC)

The concept of VOC is nothing new and has been taught at universities for years as a well-known concept applied in sales, marketing, and many other business operations. VOC is the concept of understanding customer needs by learning from or listening to their needs before, during, and after they use a company's product or service. The relevance of this concept remains important today and should be applied to every data project that you participate in. This process is where you should interview the consumers of the data analysis results before even looking at the data. If you are working with business users, listen to what their needs are by writing down the specific points on what business questions are they trying to answer.

Schedule a working session with them where you can engage in a dialog. Make sure you focus on their current pain points such as the time to curate all of the data used to make decisions. Does it take three days to complete the process every month? If you can deliver an automated data product or a dashboard that can reduce that time down to a few mouse clicks, your data analysis skills will make you look like a hero to your business users.

During a tech talk at a local university, I was asked the difference between KYD and VOC. I explained that both are important and focused on communicating and learning more about the subject area or business. The key differences are prepared versus present. KYD is all about doing your homework ahead of time to be prepared before talking to experts. VOC is all about listening to the needs of your business or consumers regarding the data.

Always Be Agile (ABA)

The agile methodology has become commonplace in the industry for application, web, and mobile developmentSoftware Development Life Cycle (SDLC). One of the reasons that makes the agile project management process successful is that it creates an interactive communication line between the business and technical teams to iteratively deliver business value through the use of data and usable features.

The agile process involves creating stories with a common theme where a development team completes tasks in 2-3 weeksprints. In that process, it is important to understand thewhat and thewhy for each story including the business value/the problem you are trying to solve.

The agile approach hasceremonies where the developers and business sponsors come together to capture requirements and then deliverincremental value. That improvement in value could be anything from a new dataset available for access to a new feature added to an app.

See the following diagram for a nice visual representation of these concepts. Notice how these concepts are not linear and should require multiple iterations, which help to improve the communication between all people involved in the data analysis before, during, and after delivery of results:

Finally, I believe the most important trait of a good data analyst is a passion for working with data. If your passion can be fueled by continuously learning about all things data, it becomes a lifelong and fulfilling journey.

Understanding data types and their significance

As we have uncovered with the3Vs, data comes in all shapes and sizes, so let's break down some key data types and better understand why they are important. To begin, let's classify data in general terms ofunstructured,semi-structured, andstructured.

Unstructured data

The concept behindunstructured data, which is textual in nature, has been around since the 1990s and includes the following examples: the body of an email message, tweets, books, health records, and images. A simple example of unstructured data would be an email message body that is classified asfree text.Free text may have some obvious structure that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.

When working with unstructured data, there will be inconsistencies because of the nature of free text including misspellings, the different classification of dates, and so on. Always have a peer review of the workflow or code used to curate the data.

Semi-structured data

Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition oftags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in the following code:

{
"First_Name": "John",
"Last_Name": "Doe",
"Age": 42,
"Home_Address": {
"Address_1": "123 Main Street",
"Address_2": [],
"City": "New York",
"State": "NY",
"Zip_Code": "10021"
},
"Phone_Number": [
{
"Type": "cell",
"Number": "212-555-1212"
},
{
"Type": "home",
"Number": "212 555-4567"
}
],
"Children": [],
"Spouse": "yes"
}

This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now hastags created to identify those fields and values, which is a concept calledkey-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi-structured data is the flexibility to change the underlining schema of how the data is stored. The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).

The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is moved to theUser Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.

Structured data

Finally, we havestructured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table. The conformity of data and structure is the foundation for analysis, which allows both the producers and consumers of structured data to come to the same results. The topic of databases, orDatabase Management Systems (DBMS) andRelational Database Management Systems(RDMS) is vast and will not be covered here, but having some understanding will help you to become a better data analyst.

The following diagram is a basicEntity-Relationship (ER) diagram of three tables that would be found in a database:

In this example, eachentity would represent physical tables stored in the database, namedcar,part, andcar_part_bridge. The relationship between the car and part is defined by the table calledcar_part_bridge, which can be classified by multiple names such asbridge,junction,mapping, orlink table. The name of each field in the table would be on the left such aspart_id,name, ordescription found in thepart table.

Thepk label next to thecar_id andpart_idfield names helps to identify the primary keys for each table. This allows for one field to uniquely identify each record found in the table. If aprimary keyin one table exists in another table, it would be called aforeign key, which is the foundation of how the relationship between the tables is defined and ultimately joined together.

Finally, the text aligned on the right side next to the field name labeled asint ortext is the data type for each field. We will cover that concept next and you should now feel comfortable with the concepts for identifying and classifying data.

Common data types

Data types are a well-known concept in programming languages and is found in many different technologies. I have simplified the definition as, the details of the data that is stored and its intended usage. A data type will also create consistency for each data value as it's stored on disk or memory.

Data types will vary depending on the software and/or database used to create the structure. Hence, we won't be covering all the different types across all of the different coding languages but let's walk through a few examples:

Common data type

Common short name

Sample value

Example usage

Integers

int

1235

Counting occurrences, summing values, or the average of values such as sum (hits)

Booleans

bit

TRUE

Conditional testing such as if sales > 1,000,true elsefalse

Geospatial

float orspatial

40.229290,-74.936707

Geo analytics based on latitude and longitude

Characters/string

char

A

Tagging, binning, or grouping data

Floating-point numbers

float ordouble

2.1234

Sales, cost analysis, or stock price

Alphanumeric strings

blob orvarchar

United States

Tagging, binning, encoding, or grouping data

Time

time,timestamp,date

8/19/2000

Time-series analysis or year-over-year comparison

Technologies change and legacy systems will offer opportunities to see data types that may not be common. The best advice when dealing with new data types is to validate the source systems that are created by speaking to anSME (Subject Matter Expert) or system administrator, or to ask for documentation that includes the active version used to persist the data.

In the preceding table, I've created a summary of some common data types. Getting comfortable understanding the differences between data types is important because it determines what type of analysis can be performed on each data value. Numeric data types such as integer (int), floating-point numbers (float), ordoubleare used for mathematical calculations of values such as the sum of sales, count of apples, or the average price of a stock. Ideally, the source system of the record should enforce the data type but there can be and usually are exceptions.

As you evolve your data analysis skills, helping to resolve data type issues or offer suggestions to improve them will make the quality and accuracy of reporting better throughout the organization.

String data types that are defined in the preceding table as characters (char) and alphanumeric strings (varchar orblob) can be represented as text such as a word or full sentence.Time is a special data type that can be represented and stored in multiple ways such as12 PM EST or a date such as08/19/2000. Consider geospatial coordinates such as latitude and longitude, which can be stored in multiple data types depending on the source system.

The goal of this chapter is to introduce you to the concept of data types and future chapters will give direct, hands-on experience of working with them. The reason why data types are important is to avoid incomplete or inaccurate information when presenting facts and insights from analysis. Invalid or inconsistent data types also restrict the ability to create accurate charts or data visualizations. Finally, good data analysis is about having confidence and trust that your conclusions are complete with defined data types that support your analysis.

Data classifications and data attributes explained

Now that we understand more about data types and why they are important, let's break down the different classifications of data and understand the different data attribute types. To begin with a visual, let's summarize all of the possible combinations in the following summary diagram:

In the preceding diagram, the boxes directly belowdata have the three methods to classify data, which arecontinuous,categorical, ordiscrete.

Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities. The bottom boxes in this diagram are examples so you can easily find them for reference. Continuous data examples include a stock price, weight in pounds, and time.

Categorical (descriptive) data will have values as astringdata type. Categorical data isqualifiedso it would describe something specific such as a person, place, or thing. Some examples include a country of origin, a month of the year, the different types of trees, and your family designation.

Adiscrete data type can be either continuous or categorical depending on how it's used for analysis. Examples include the number of employees in a company. You must have an integer/whole number representing the count for each employee, because you can never have partial results such as half an employee. Discrete data is continuous in nature because of its numeric properties but also has limits that make it similar to categorical. Another example would be the numbers on a roulette wheel. There is a limit of whole numbers available on the wheel from1 to36,0, or00 that a player can bet on, plus the numbers can be categorized as red, black, or green depending on the value.

If only two discrete values exist, such asyes/no ortrue/false or1/0, it can also be classified as binary.

Data attributes

Now that we understand how to classify data, let's break down the attribute types available to better understand how you can use them for analysis. The easiest method to break down types is to start with how you plan on using the data values for analysis:

  • Nominal data is defined as data where you can distinguish between different values but not necessarily order them. It is qualitative in nature, so think of nominal data as labels or names asstocks orbonds where math cannot be performed on them because they are string values. With nominal values, you cannot determine whether the wordstocks orbonds are better or worse without additional information.
  • Ordinal data isordered data where a ranking exists, but the distance or range between values cannot be defined. Ordinal data is qualitative using labels or names but now the values will have a natural or defined sequence. Similar to nominal data, ordinal data can be counted but not calculated with all statistical methods.

An example is assigning1 = low,2 = medium, and3 = high values. This has a natural sequence but the difference between low and high cannot be quantified by itself. The data assigned tolow andhigh values could be arbitrary or have additional business rules behind it.

Another common example of ordinal data is natural hierarchies such as state, county, and city, or grandfather, father, and son. The relationship between these values are well defined and commonly understood without any additional information to support it. So, a son will have a father but a father cannot be a son.

  • Interval data is like ordinal data, but the distance between data points is uniform. Weight on a scale in pounds is a good example because the difference between the values from5 to10,10 to15, and20 to25 are all the same. Note that not every arithmetic operation can be performed on interval data so understanding the context of the data and how it should be used becomes important.

Temperature is a good example to demonstrate this paradigm. You can record hourly values and even provide a daily average, but summing the values per day or week would not provide accurate information for analysis. See the following diagram, which provides an hourly temperature for a specific day. Notice thexaxis breaks out the hours and they axis provides the average, which is labeledAvg Temperature, in Fahrenheit. The values between each hour must be an average or mean because an accumulation of temperature would provide misleading results and inaccurate analysis:

  • Ratio data allows for all arithmetic operations including sum, average, median, mode, multiplication, and division. The data types ofinteger andfloat discussed earlier are classified as ratio data attributes, which in turn are also numeric/quantitative. Also,time could be classified as ratio data,however, I decided tofurther break down this attribute because of how often it is used for data analysis.
Note that there are advanced statistical details about ratio data attributes that are not covered in this book, such as having an absolute or true zero, so I encourage you to learn more about the subject.
  • Time data attributes as a rich subject that you will come across regularly during your data analysis journey. Time data covers both date and time or any combination, for example, the time asHH:MM AM/PM, such as12:03 AM; the year asYYYY, such as1980; a timestamp represented asYYYY-MM-DD hh:mm:ss, such as2000-08-19 14:32:22; or even a date asMM/DD/YY, such as08/19/00. What's important to recognize when dealing with time data is to identify the intervals between each value so you can accurately measure the difference between them.
It is common during many data analysis projects that you find gaps in the sequence of time data values. For example, you are given a dataset with a range between08/01/2019 to08/31/2019 but only 25 distinct date values exist versus 30 days of data. There are various reasons for this occurrence including system outages where log data was lost. How to handle those data gaps will vary depending on the type of analysis you have to perform, including the need to fill in missing results. We will cover some examples inChapter 7,Exploring Cleaning, Refining, and Blending Datasets.

Understanding data literacy

Data literacy is defined by Rahul Bhargava and Catherine D'Ignazio asthe ability to read, work with, analyze, and arguewith data. Throughout this chapter, I have pointed out how data comes in all shapes and sizes, so creating a common framework to communicate about data between different audiences becomes an important skill to master.

Data literacy becomes a common denominator for answering data questions between two or more people with different skills or experience. For example, if a sales manager wants to verify the data behind a chart in a quarterly report, having them fluent in the language of data will save time. Time is saved by asking direct questions about thedata types anddata attributes with the engineering team versus searching for those details aimlessly.

Let's break down the concepts of data literacy to help to identify how it can be applied to your personal and professional life.

Reading data

What does it mean to read data? Reading data is consuming information, and that information can be in any format including a chart, a table, code, or the body of an email.

Reading data may not necessarily provide the consumer with all of the answers to their questions. Having domain expertise may be required to understand how, when, and why a dataset was created to allow the consumer to fully interpret the underlying dataset.

For example, you are a data analyst and your colleague sends a file attachment to your email with the subject line as FYI and no additional information in the body of the message. We now know from theWhat makes a good data analyst? section that we should start asking questions about the file attachment:

  • What methods were used to create the file (human or machine)?
  • What system(s) and workflow were used to create the file?
  • Who created the file and when was it created?
  • How often does this file refresh and is it manual or automated?

Asking these questions helps you to understand the concept ofdata lineage, which can identify the process of how a dataset was created. This will ensure reading the data will result in understanding all aspects to focus on making decisions from it confidently.

Working with data

I defineworking withdata as the person or system that creates a dataset using any technology. The technologies used to create data are vastly varied and could be as simple as someone typing rows and columns in spreadsheets, to having a software developer use loops and functions in Python code to create a pipe-delimited file.

Since writing data varies by expertise and job function, a key takeaway from a data literacy perspective is that the producer of data should be conscious of how it will be consumed. Ideally, the producer should document the details of how, when, and where the data was created to include the frequency of how often it is refreshed. Publishing this information democratizes the metadata (data about the data) to improve the communication between anyone reading and working with the data.

For example, if you have a timestamp field in your dataset, is it usingUTC (Coordinated Universal Time) orEST (Eastern Standard Time)? By including assumptions and reasons why the data is stored in a specific format, the person or team working with the data become good data citizens by improving the communication for analysis.

Analyzing data

Analyzing data begins with modeling and structuring it to answer business questions. Data modeling is a vast topic but for data literacy purposes, it can be boiled down todimensions andmeasures. Dimensions are distinct nouns such as a person, place, or thing, and measures are verbs based on actions and then aggregated (sum, count, min, max, and average).

The foundation for building any data visualization and charts is rooted in data modeling and most modern tech solutions have it built in so you may be already modeling data without even realizing it.

One quick solution to help to classify how the data should be used for analysis would be adata dictionary, which is defined asa centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.

You might be able to find a data dictionary in the help pages of source systems or from GitHub repositories. If you don't receive one from the creator of the file, you can create one for yourself and use it to ask questions about the data including assumed data types, data quality, and identifying data gaps.

Creating a data dictionary also helps to validate assumptions and is an aid to frame questions about the data when communicating with others. The easiest method to create adata dictionary would be to transpose the first few rows of the source data so the rows turn into columns. If your data has a header row, then the first row turns into a list of all fields available. Let's walk through an example of how to create your own data dictionary from data. Here, we have a sourceSalestable representingProductandCustomersales by quarter:

Product

Customer

Quarter 1

Quarter 2

Quarter 3

Quarter 4

Product 1

Customer A

$ 1,000.00

$ 2,000.00

$ 6,000.00

Product 1

Customer B

$ 1,000.00

$ 500.00

Product 2

Customer A

$ 1,000.00

Product 2

Customer C

$ 2,000.00

$ 2,500.00

$ 5,000.00

Product 3

Customer A

$ 1,000.00

$ 2,000.00

Product 4

Customer B

$ 1,000.00

$ 3,000.00

Product 5

Customer A

$ 1,000.00

In the following table, I have transposed the preceding source table to create a new table for analysis, which creates an initial data dictionary. The first column on the left becomes a list of all of the fields available from the source table. As you can see from the fields,Record 1 toRecord 3 in the header row now become sample rows of data but retain the integrity of each row from the source table. The last two columns on the right in the following table, labeledEstimated Data Type andDimension or Measure, were added to help to define the use of this data for analysis. Understanding the data type and classifying each field as a dimension or measure will help to determine what type of analysis we can perform and how each field can be used in data visualizations:

Field Name

Record 1

Record 2

Record 3

Estimated Data Type

Dimension or Measure

Product

Product 1

Product 1

Product 2

varchar

Dimension

Customer

Customer A

Customer B

Customer A

varchar

Dimension

Quarter 1

$ 1,000.00

float

Measure

Quarter 2

$ 2,000.00

$ 1,000.00

$ 1,000.00

float

Measure

Quarter 3

$ 6,000.00

$ 500.00

float

Measure

Quarter 4

float

Measure

Using this technique can help you to ask the following questions about the data to ensure you understand the results:

  • What year does this dataset represent or is it an accumulation of multiple years?
  • Does each quarter represent a calendar year or fiscal year?
  • Was Product 5 first introduced in Quarter 4, because there are no prior sales for that product by any customer in Quarter 1 to Quarter 3?

Arguing about the data

Finally, let's talk about how and why we should argue about data. Challenging and defending the numbers in charts or data tables helps to build credibility and is actually done in many cases behind the scenes. For example, most data engineering teams put in various checks and balances such as alerts during ingestion to avoid missing information. Additional checks would also include rules to look into log files for anomalies or errors in the processing of data.

From a consumer's perspective,trust and verify is a good approach. For example, when looking at a chart published in a credible news article, you can assume the data behind the story is accurate but you should also verify the accuracy of the source data. The first thing to ask would be: does the underlying chart include a source to the dataset that is publicly available? The websitefivethirtyeight.comis really good at providing access to the raw data and details of methodologies used to create analysis and charts found in news stories. Exposing the underlining dataset and the process used to collect it to the public opens up conversations about the how, what, and why behind the data and is a good method to disprove misinformation.

As a data analyst and creator of data outputs, the ability to defend your work should be received with open arms. Having documentation such as a data dictionary and GitHub repository and documenting the methodology used to produce the data will build trust with the audience and reduce the time for them to make data-driven decisions.

Hopefully, you now see the importance of data literacy and how it can be used to improve all aspects of communication of data between consumers and producers. With any language, practice will lead to improvement, so I invite you to explore some useful free datasets to improve your data literacy.

Here are a few to get started:

Let's begin with theKagglesite, which was created to help companies to host data science competitions to solve complex problems using data. Improve your reading and working with data literacy skills by exploring these datasets and walking through the concepts learned in this chapter such as identifying the data type for each field and confirming a data dictionary exists.

Next is the supporting data fromFiveThirtyEight, which is a data journalism site providing analytic content from sports to politics. What I like about their process is the offer of transparency behind the news stories published by exposing open GitHub links to their source data and discussions about their methodology behind the data.

Another important open source for data would beThe World Bank, which offers a plethora of options to consume or produce data across the world to help to improve life through data. Most of the datasets are licensed under aCreative Commons license, which governs the terms of how and when data can be used, but making them freely available opens up opportunities to blend public and private data together with significant time savings.

Summary

Let's look back at what we learned in this chapter and the skills obtained before we move forward. First, we covered a brief history of data analysis and the technological evolution of data by paying homage to the people and milestone events that made working with data possible using modern tools and techniques. We walked through an example of how to summarize these events using a data visual trend chart that showed how recent technology innovations have transformed the data industry.

We focused on why data has become important to make decisions from both a consumer and producer perspective by discussing the concepts for identifying and classifying data using structured, semi-structured, and unstructured examples and the3Vsof big data:Volume,Velocity, andVariety.

We answered the question of what makes a good data analyst using the techniques of KYD, VOC, and ABA.

Then, we went deeper into understandingdata types by walking through the differences between numbers (integer and float) versus strings (text, time, dates, and coordinates). This includedbreaking down data classifications (continuous, categorical, and discrete) and understanding data attribute types.

We wrapped up this chapter by introducing the concept ofdata literacyand its importance to the consumers and producers of data by improving communication between them.

In our next chapter,we will get more hands-on by installing and setting up an environment for data analysis and so begin the journey of applying the concepts learned about data.

Further reading

Here are some links that you can refer to for gathering more information about the following topics:

Left arrow icon

Page1 of 8

Right arrow icon
Download code iconDownload Code

Key benefits

  • Find out how to use Python code to extract insights from data using real-world examples
  • Work with structured data and free text sources to answer questions and add value using data
  • Perform data analysis from scratch with the help of clear explanations for cleaning, transforming, and visualizing data

Description

Data literacy is the ability to read, analyze, work with, and argue using data. Data analysis is the process of cleaning and modeling your data to discover useful information. This book combines these two concepts by sharing proven techniques and hands-on examples so that you can learn how to communicate effectively using data.After introducing you to the basics of data analysis using Jupyter Notebook and Python, the book will take you through the fundamentals of data. Packed with practical examples, this guide will teach you how to clean, wrangle, analyze, and visualize data to gain useful insights, and you'll discover how to answer questions using data with easy-to-follow steps.Later chapters teach you about storytelling with data using charts, such as histograms and scatter plots. As you advance, you'll understand how to work with unstructured data using natural language processing (NLP) techniques to perform sentiment analysis. All the knowledge you gain will help you discover key patterns and trends in data using real-world examples. In addition to this, you will learn how to handle data of varying complexity to perform efficient data analysis using modern Python libraries.By the end of this book, you'll have gained the practical skills you need to analyze data with confidence.

Who is this book for?

This book is for aspiring data analysts and data scientists looking for hands-on tutorials and real-world examples to understand data analysis concepts using SQL, Python, and Jupyter Notebook. Anyone looking to evolve their skills to become data-driven personally and professionally will also find this book useful. No prior knowledge of data analysis or programming is required to get started with this book.

What you will learn

  • Understand the importance of data literacy and how to communicate effectively using data
  • Find out how to use Python packages such as NumPy, pandas, Matplotlib, and the Natural Language Toolkit (NLTK) for data analysis
  • Wrangle data and create DataFrames using pandas
  • Produce charts and data visualizations using time-series datasets
  • Discover relationships and how to join data together using SQL
  • Use NLP techniques to work with unstructured data to create sentiment analysis models
  • Discover patterns in real-world datasets that provide accurate insights

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date :Jun 19, 2020
Length:322 pages
Edition :1st
Language :English
ISBN-13 :9781838825096
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature iconInstant access to your Digital eBook purchase
Product feature icon Download this book inEPUB andPDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature iconDRM FREE - Read whenever, wherever and however you want
Product feature iconAI Assistant (beta) to help accelerate your learning

Contact Details

Modal Close icon
Payment Processing...
tickCompleted

Billing Address

Product Details

Publication date :Jun 19, 2020
Length:322 pages
Edition :1st
Language :English
ISBN-13 :9781838825096
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99billed monthly
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconSimple pricing, no contract
€189.99billed annually
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconChoose a DRM-free eBook or Video every month to keep
Feature tick iconPLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick iconExclusive print discounts
€264.99billed in 18 months
Feature tick iconUnlimited access to Packt's library of 7,000+ practical books and videos
Feature tick iconConstantly refreshed with 50+ new titles a month
Feature tick iconExclusive Early access to books as they're written
Feature tick iconSolve problems while you work with advanced search and reference features
Feature tick iconOffline reading on the mobile app
Feature tick iconChoose a DRM-free eBook or Video every month to keep
Feature tick iconPLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick iconExclusive print discounts

Frequently bought together


Python Data Cleaning Cookbook
Python Data Cleaning Cookbook
Read more
Dec 2020436 pages
Full star icon4.8 (28)
eBook
eBook
€8.98€29.99
€36.99
Python Data Analysis
Python Data Analysis
Read more
Oct 2014348 pages
Full star icon4.5 (13)
eBook
eBook
€8.98€32.99
€41.99
€29.99
Practical Data Analysis Using Jupyter Notebook
Practical Data Analysis Using Jupyter Notebook
Read more
Jun 2020322 pages
Full star icon3.9 (9)
eBook
eBook
€8.98€23.99
€29.99
Stars icon
Total96.97
Python Data Cleaning Cookbook
€36.99
Python Data Analysis
€29.99
Practical Data Analysis Using Jupyter Notebook
€29.99
Total96.97Stars icon
Buy 2+ to unlock€6.99 prices - master what's next.
SHOP NOW

Table of Contents

17 Chapters
Section 1: Data Analysis EssentialsChevron down iconChevron up icon
Section 1: Data Analysis Essentials
Fundamentals of Data AnalysisChevron down iconChevron up icon
Fundamentals of Data Analysis
The evolution of data analysis and why it is important
What makes a good data analyst?
Understanding data types and their significance
Data classifications and data attributes explained
Understanding data literacy
Summary
Further reading
Overview of Python and Installing Jupyter NotebookChevron down iconChevron up icon
Overview of Python and Installing Jupyter Notebook
Technical requirements
Installing Python and using Jupyter Notebook
Storing and retrieving data files
Hello World! – running your first Python code
Exploring Python packages
Summary
Future reading
Getting Started with NumPyChevron down iconChevron up icon
Getting Started with NumPy
Technical requirements
Understanding a Python NumPy array and its importance
Making your first NumPy array
Practical use cases of NumPy and arrays
Summary
Further reading
Creating Your First pandas DataFrameChevron down iconChevron up icon
Creating Your First pandas DataFrame
Technical requirements
Techniques for manipulating tabular data
Understanding pandas and DataFrames
Handling essential data formats
Data dictionaries and data types
Creating our first DataFrame
Summary
Further reading
Gathering and Loading Data in PythonChevron down iconChevron up icon
Gathering and Loading Data in Python
Technical requirements
Introduction to SQL and relational databases
From SQL to pandas DataFrames
Data about your data explained
The importance of data lineage
Summary
Further reading
Section 2: Solutions for Data DiscoveryChevron down iconChevron up icon
Section 2: Solutions for Data Discovery
Visualizing and Working with Time Series DataChevron down iconChevron up icon
Visualizing and Working with Time Series Data
Technical requirements
Data modeling for results
Anatomy of a chart and data viz best practices
Comparative analysis
The shape of the curve
Summary
Further reading
Exploring, Cleaning, Refining, and Blending DatasetsChevron down iconChevron up icon
Exploring, Cleaning, Refining, and Blending Datasets
Technical requirements
Retrieving, viewing, and storing tabular data
Learning how to restrict, sort, and sift through data
Cleaning, refining, and purifying data using Python
Combining and binning data
Summary
Further reading
Understanding Joins, Relationships, and AggregatesChevron down iconChevron up icon
Understanding Joins, Relationships, and Aggregates
Technical requirements
Foundations of join relationships
Join types in action
Explaining data aggregation
Summary statistics and outliers
Summary
Further reading<a href="https://matplotlib.org/index.html">
Plotting, Visualization, and StorytellingChevron down iconChevron up icon
Plotting, Visualization, and Storytelling
Technical requirements
Explaining distribution analysis
Understanding outliers and trends
Geoanalytical techniques and tips
Finding patterns in data
Summary
Further reading
Section 3: Working with Unstructured Big DataChevron down iconChevron up icon
Section 3: Working with Unstructured Big Data
Exploring Text Data and Unstructured DataChevron down iconChevron up icon
Exploring Text Data and Unstructured Data
Technical requirements
Preparing to work with unstructured data
Tokenization explained
Counting words and exploring results
Normalizing text techniques
Excluding words from analysis
Summary
Further reading
Practical Sentiment AnalysisChevron down iconChevron up icon
Practical Sentiment Analysis
Technical requirements
Why sentiment analysis is important
Elements of an NLP model
Sentiment analysis packages
Sentiment analysis in action
Summary
Further reading
Bringing It All TogetherChevron down iconChevron up icon
Bringing It All Together
Technical requirements
Discovering real-world datasets
Reporting results
The Capstone project
Summary
Further reading
Works CitedChevron down iconChevron up icon
Works Cited
Other Books You May EnjoyChevron down iconChevron up icon
Other Books You May Enjoy
Leave a review - let other readers know what you think

Recommendations for you

Left arrow icon
LLM Engineer's Handbook
LLM Engineer's Handbook
Read more
Oct 2024522 pages
Full star icon4.9 (27)
eBook
eBook
€8.98€43.99
€54.99
Getting Started with Tableau 2018.x
Getting Started with Tableau 2018.x
Read more
Sep 2018396 pages
Full star icon4 (3)
eBook
eBook
€8.98€32.99
€41.99
Python for Algorithmic Trading Cookbook
Python for Algorithmic Trading Cookbook
Read more
Aug 2024406 pages
Full star icon4.3 (20)
eBook
eBook
€8.98€35.99
€44.99
RAG-Driven Generative AI
RAG-Driven Generative AI
Read more
Sep 2024338 pages
Full star icon4.3 (16)
eBook
eBook
€8.98€32.99
€40.99
Machine Learning with PyTorch and Scikit-Learn
Machine Learning with PyTorch and Scikit-Learn
Read more
Feb 2022774 pages
Full star icon4.4 (87)
eBook
eBook
€8.98€32.99
€41.99
€59.99
Building LLM Powered Applications
Building LLM Powered Applications
Read more
May 2024342 pages
Full star icon4.2 (21)
eBook
eBook
€8.98€29.99
€37.99
Python Machine Learning By Example
Python Machine Learning By Example
Read more
Jul 2024526 pages
Full star icon4.3 (25)
eBook
eBook
€8.98€27.99
€34.99
AI Product Manager's Handbook
AI Product Manager's Handbook
Read more
Nov 2024488 pages
eBook
eBook
€8.98€27.99
€34.99
Right arrow icon

Customer reviews

Top Reviews
Rating distribution
Full star iconFull star iconFull star iconHalf star iconEmpty star icon3.9
(9 Ratings)
5 star55.6%
4 star11.1%
3 star11.1%
2 star11.1%
1 star11.1%
Filter icon Filter
Top Reviews

Filter reviews by




Amazon CustomerNov 27, 2021
Full star iconFull star iconFull star iconFull star iconFull star icon5
Being a novice, I found this book very well written, informative and above all, helpful. The author knows how to explain concepts in easy to understand language - I didn’t have to google every other word.
Amazon Verified reviewAmazon
JuliaASep 16, 2020
Full star iconFull star iconFull star iconFull star iconFull star icon5
great book and very fast delivery
Amazon Verified reviewAmazon
Amazon Customer - EFSep 14, 2020
Full star iconFull star iconFull star iconFull star iconFull star icon5
Do NOT listen to the bad reviews, they must have something against the author of the book. This book was VERY well-written, easy to understand, and useful. Highly recommend!!
Amazon Verified reviewAmazon
Richard LyonsMar 15, 2021
Full star iconFull star iconFull star iconFull star iconFull star icon5
Step by step guides are essential!
Amazon Verified reviewAmazon
Josef BauerAug 19, 2021
Full star iconFull star iconFull star iconFull star iconFull star icon5
Für mich genau der richtige Einstieg in die Data Analysis Welt und Jupyter Notebook. Etwas Python-Kenntnisse sind allerdings sehr von Vorteil.
Amazon Verified reviewAmazon
  • Arrow left icon Previous
  • 1
  • 2
  • Arrow right icon Next

People who bought this also bought

Left arrow icon
Causal Inference and Discovery in Python
Causal Inference and Discovery in Python
Read more
May 2023466 pages
Full star icon4.5 (47)
eBook
eBook
€8.98€32.99
€40.99
Generative AI with LangChain
Generative AI with LangChain
Read more
Dec 2023376 pages
Full star icon4.1 (33)
eBook
eBook
€8.98€47.99
€59.99
Modern Generative AI with ChatGPT and OpenAI Models
Modern Generative AI with ChatGPT and OpenAI Models
Read more
May 2023286 pages
Full star icon4.1 (30)
eBook
eBook
€8.98€29.99
€37.99
Deep Learning with TensorFlow and Keras – 3rd edition
Deep Learning with TensorFlow and Keras – 3rd edition
Read more
Oct 2022698 pages
Full star icon4.5 (44)
eBook
eBook
€8.98€29.99
€37.99
Machine Learning Engineering  with Python
Machine Learning Engineering with Python
Read more
Aug 2023462 pages
Full star icon4.6 (37)
eBook
eBook
€8.98€29.99
€37.99
Right arrow icon

About the author

Profile icon Marc Wintjen
Marc Wintjen
Github icon
Marc Wintjen is a Risk Analytics Architect at Bloomberg LP with over 20 years of professional experience. An evangelist for data literacy, hes known as the Data Mensch by helping others make data driven decisions. His passion for all things data has evolved from SQL and Data Warehousing to Big Data Analytics and Data Visualizations.
Read more
See other products by Marc Wintjen
Getfree access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook?Chevron down iconChevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website?Chevron down iconChevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook?Chevron down iconChevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support?Chevron down iconChevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks?Chevron down iconChevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook?Chevron down iconChevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.

Create a Free Account To Continue Reading

Modal Close icon
OR
    First name is required.
    Last name is required.

The Password should contain at least :

  • 8 characters
  • 1 uppercase
  • 1 number
Notify me about special offers, personalized product recommendations, and learning tips By signing up for the free trial you will receive emails related to this service, you can unsubscribe at any time
By clicking ‘Create Account’, you are agreeing to ourPrivacy Policy andTerms & Conditions
Already have an account? SIGN IN

Sign in to activate your 7-day free access

Modal Close icon
OR
By redeeming the free trial you will receive emails related to this service, you can unsubscribe at any time.

[8]ページ先頭

©2009-2025 Movatter.jp