Data Analysis with Open Source Tools

A Hands-On Guide for Programmers and Data Secrets

Specificaties
Paperback, 509 blz. | Engels
O'Reilly | 1e druk, 2011
ISBN13: 9780596802356
Rubricering
Hoofdrubriek : Computer en informatica
O'Reilly 1e druk, 2011 9780596802356
Verwachte levertijd ongeveer 16 werkdagen

Samenvatting

Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Along the way, you'll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve -- rather than rely on tools to think for you.

- Use graphics to describe data with one, two, or dozens of variables
- Develop conceptual models using back-of-the-envelope calculations, as well as scaling and probability arguments
- Mine data with computationally intensive methods such as simulation and clustering
- Make your conclusions understandable through reports, dashboards, and other metrics programs
- Understand financial calculations, including the time-value of money
- Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
- Become familiar with different open source programming environments for data analysis

"Finally, a concise reference for understanding how to conquer piles of data." -Austin King, Senior Web Developer, Mozilla
"An indispensable text for aspiring data scientists." -Michael E. Driscoll, CEO/Founder, Dataspora

Specificaties

ISBN13:9780596802356
Taal:Engels
Bindwijze:paperback
Aantal pagina's:509
Uitgever:O'Reilly
Druk:1

Over Philipp Janert

Philipp K. Janert provides consulting services for data analysis and mathematical modeling, drawing on his previous careers as physcist and software engineer. He is the author of Gnuplot in Action: Understanding Data with Graphs (Manning Publications) and has written for the O'Reilly Network, IBM developerWorks, and IEEE Software. He holds a Ph.D. in theoretical physics from the University of Washington.

Andere boeken door Philipp Janert

Inhoudsopgave

Preface

1. Introduction
-Data Analysis
-What's in This Book
-What's with the Workshops?
-What's with the Math?
-What You'll Need
-What's Missing

Part 1:Graphics: Looking at Data
2. A Single Variable: Shape and Distribution
-Dot and Jitter Plots
-Histograms and Kernel Density Estimates
-The Cumulative Distribution Function
-Rank-Order Plots and Lift Charts
-Only When Appropriate: Summary Statistics and Box Plots
-Workshop: NumPy
-Further Reading

3. Two Variables: Establishing Relationships
-Scatter Plots
-Conquering Noise: Smoothing
-Logarithmic Plots
-Banking
-Linear Regression and All That
-Showing What's Important
-Graphical Analysis and Presentation Graphics
-Workshop: matplotlib
-Further Reading

4. Time As a Variable: Time-Series Analysis
-Examples
-The Task
-Smoothing
-Don't Overlook the Obvious!
-The Correlation Function
-Optional: Filters and Convolutions
-Workshop: scipy.signal
-Further Reading

5. More Than Two Variables: Graphical Multivariate Analysis
-False-Color Plots
-A Lot at a Glance: Multiplots
-Composition Problems
-Novel Plot Types
-Interactive Explorations
-Workshop: Tools for Multivariate Graphics
-Further Reading

6. Intermezzo: A Data Analysis Session
-A Data Analysis Session
-Workshop: gnuplot
-Further Reading

Part 2: Analytics: Modeling Data
7. Guesstimation and the Back of the Envelope
-Principles of Guesstimation
-How Good Are Those Numbers?
-Optional: A Closer Look at Perturbation Theory and Error Propagation
-Workshop: The Gnu Scientific Library (GSL)
-Further Reading

8. Models from Scaling Arguments
-Models
-Arguments from Scale
-Mean-Field Approximations
-Common Time-Evolution Scenarios
-Case Study: How Many Servers Are Best?
-Why Modeling?
-Workshop: Sage
-Further Reading

9. Arguments from Probability Models
-The Binomial Distribution and Bernoulli Trials
-The Gaussian Distribution and the Central Limit Theorem
-Power-Law Distributions and Non-Normal Statistics
-Other Distributions
-Optional: Case Study-Unique Visitors over Time
-Workshop: Power-Law Distributions
-Further Reading

10. What You Really Need to Know About Classical Statistics
-Genesis
-Statistics Defined
-Statistics Explained
-Controlled Experiments Versus Observational Studies
-Optional: Bayesian Statistics-The Other Point of View
-Workshop: R
-Further Reading

11. Intermezzo: Mythbusting-Bigfoot, Least Squares, and All That
-How to Average Averages
-The Standard Deviation
-Least Squares
-Further Reading

Part 3: Computation: Mining Data
12. Simulations
-A Warm-Up Question
-Monte Carlo Simulations
-Resampling Methods
-Workshop: Discrete Event Simulations with SimPy
-Further Reading

13. Finding Clusters
-What Constitutes a Cluster?
-Distance and Similarity Measures
-Clustering Methods
-Pre- and Postprocessing
-Other Thoughts
-A Special Case: Market Basket Analysis
-A Word of Warning
-Workshop: Pycluster and the C Clustering Library
-Further Reading

14. Seeing the Forest for the Trees: Finding Important Attributes
-Principal Component Analysis
-Visual Techniques
-Kohonen Maps
-Workshop: PCA with R
-Further Reading

15. Intermezzo: When More Is Different
-A Horror Story
-Some Suggestions
-What About Map/Reduce?
-Workshop: Generating Permutations
-Further Reading

Part 4: Applications: Using Data
16. Reporting, Business Intelligence, and Dashboards
-Business Intelligence
-Corporate Metrics and Dashboards
-Data Quality Issues
-Workshop: Berkeley DB and SQLite
-Further Reading

17. Financial Calculations and Modeling
-The Time Value of Money
-Uncertainty in Planning and Opportunity Costs
-Cost Concepts and Depreciation
-Should You Care?
-Is This All That Matters?
-Workshop: The Newsvendor Problem
-Further Reading

18. Predictive Analytics
-Topics in Predictive Analytics
-Some Classification Terminology
-Algorithms for Classification
-The Process
-The Secret Sauce
-The Nature of Statistical Learning
-Workshop: Two Do-It-Yourself Classifiers
-Further Reading

19. Epilogue: Facts Are Not Reality

Appendix A: Programming Environments for Scientific Computation and Data
Appendix B: Results from Calculus
Appendix C: Working with Data
Appendix D: About the Author

Index

Net verschenen

Rubrieken

Populaire producten

    Personen

      Trefwoorden

        Data Analysis with Open Source Tools