Data Analysis With Open Source Tools

Data Analysis with Open Source Tools
by Philipp K. Janert

Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You’ll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Along the way, you’ll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you’ll learn how to think about the results you want to achieve — rather than rely on tools to think for you.

  • Use graphics to describe data with one, two, or dozens of variables
  • Develop conceptual models using back-of-the-envelope calculations, as well asscaling and probability arguments
  • Mine data with computationally intensive methods such as simulation and clustering
  • Make your conclusions understandable through reports, dashboards, and other metrics programs
  • Understand financial calculations, including the time-value of money
  • Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
  • Become familiar with different open source programming environments for data analysis

“Finally, a concise reference for understanding how to conquer piles of data.”–Austin King, Senior Web Developer, Mozilla

“An indispensable text for aspiring data scientists.”–Michael E. Driscoll, CEO/Founder, Dataspora


Bioinformatics Data Skills
by Vince Buffalo

Learn the data skills necessary for turning large sequencing datasets into reproducible and robust biological findings. With this practical guide, you’ll learn how to use freely available open source tools to extract meaning from large complex biological data sets.

At no other point in human history has our ability to understand life’s complexities been so dependent on our skills to work with and analyze data. This intermediate-level book teaches the general computational and data skills you need to analyze biological data. If you have experience with a scripting language like Python, you’re ready to get started.

  • Go from handling small problems with messy scripts to tackling large problems with clever methods and tools
  • Process bioinformatics data with powerful Unix pipelines and data tools
  • Learn how to use exploratory data analysis techniques in the R language
  • Use efficient methods to work with genomic range data and range operations
  • Work with common genomics data file formats like FASTA, FASTQ, SAM, and BAM
  • Manage your bioinformatics project with the Git version control system
  • Tackle tedious data processing tasks with with Bash scripts and Makefiles

Data Analysis with Open Source Tools
by Philipp Janert

Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You’ll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Along the way, you’ll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you’ll learn how to think about the results you want to achieve — rather than rely on tools to think for you.

  • Use graphics to describe data with one, two, or dozens of variables
  • Develop conceptual models using back-of-the-envelope calculations, as well as scaling and probability arguments
  • Mine data with computationally intensive methods such as simulation and clustering
  • Make your conclusions understandable through reports, dashboards, and other metrics programs
  • Understand financial calculations, including the time-value of money
  • Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
  • Become familiar with different open source programming environments for data analysis

“Finally, a concise reference for understanding how to conquer piles of data.” –Austin King, Senior Web Developer, Mozilla

“An indispensable text for aspiring data scientists.” –Michael E. Driscoll, CEO/Founder, Dataspora


Open Source Geospatial Tools
by Daniel McInerney, Pieter Kempeneers

This book focuses on the use of open source software for geospatial analysis. It demonstrates the effectiveness of the command line interface for handling both vector, raster and 3D geospatial data. Appropriate open-source tools for data processing are clearly explained and discusses how they can be used to solve everyday tasks.

A series of fully worked case studies are presented including vector spatial analysis, remote sensing data analysis, landcover classification and LiDAR processing. A hands-on introduction to the application programming interface (API) of GDAL/OGR in Python/C++ is provided for readers who want to extend existing tools and/or develop their own software.


Data Analytics Using Open-Source Tools
by Jeffrey Strickland

This book is about Data Analytics. In that respect, it is like others. What distinguishes it from the rest is the variety of open-source tool applications. This book incorporates the use of R Studio, Python, SAS Studio (University Edition), and KNIME. This book is also about manipulating Big Data. Apache Hadoop on Hortonworks Sandbox is introduced and we manage, move, handle, and transform data using Apache Hive, Apache Spark, MapReduce and TEZ, with terminal shell commands and Ambari. We show you how to set up a virtual machine in Microsoft Azure. We then use the data in later chapters for modeling. We cover Descriptive Modeling and Predictive. The content includes Support Vector Machines, Decision Tree learning, Random Forests, Naive and Empirical Bayes, Gradient Boosting, Cluster Modeling, Generalized Linear Models, Logistic Regression, and Artificial Neural Networks. Every chapter includes completely worked examples using one or more open-source tools.”

Practical Data Analysis
by Hector Cuesta, Dr. Sampath Kumar

A practical guide to obtaining, transforming, exploring, and analyzing data using Python, MongoDB, and Apache Spark

About This Book

  • Learn to use various data analysis tools and algorithms to classify, cluster, visualize, simulate, and forecast your data
  • Apply Machine Learning algorithms to different kinds of data such as social networks, time series, and images
  • A hands-on guide to understanding the nature of data and how to turn it into insight

Who This Book Is For

This book is for developers who want to implement data analysis and data-driven algorithms in a practical way. It is also suitable for those without a background in data analysis or data processing. Basic knowledge of Python programming, statistics, and linear algebra is assumed.

What You Will Learn

  • Acquire, format, and visualize your data
  • Build an image-similarity search engine
  • Generate meaningful visualizations anyone can understand
  • Get started with analyzing social network graphs
  • Find out how to implement sentiment text analysis
  • Install data analysis tools such as Pandas, MongoDB, and Apache Spark
  • Get to grips with Apache Spark
  • Implement machine learning algorithms such as classification or forecasting

In Detail

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service.

This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

Style and approach

This is a hands-on guide to data analysis and data processing. The concrete examples are explained with simple code and accessible data.


Python for Data Analysis
by Wes McKinney

Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll learn the latest versions of pandas, NumPy, IPython, and Jupyter in the process.

Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It’s ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.

  • Use the IPython shell and Jupyter notebook for exploratory computing
  • Learn basic and advanced features in NumPy (Numerical Python)
  • Get started with data analysis tools in the pandas library
  • Use flexible tools to load, clean, transform, merge, and reshape data
  • Create informative visualizations with matplotlib
  • Apply the pandas groupby facility to slice, dice, and summarize datasets
  • Analyze and manipulate regular and irregular time series data
  • Learn how to solve real-world data analysis problems with thorough, detailed examples

Open Source Software in Life Science Research
by Lee Harland, Mark Forster

The free/open source approach has grown from a minor activity to become a significant producer of robust, task-orientated software for a wide variety of situations and applications. To life science informatics groups, these systems present an appealing proposition – high quality software at a very attractive price. Open source software in life science research considers how industry and applied research groups have embraced these resources, discussing practical implementations that address real-world business problems.

The book is divided into four parts. Part one looks at laboratory data management and chemical informatics, covering software such as Bioclipse, OpenTox, ImageJ and KNIME. In part two, the focus turns to genomics and bioinformatics tools, with chapters examining GenomicsTools and EBI Atlas software, as well as the practicalities of setting up an ‘omics’ platform and managing large volumes of data. Chapters in part three examine information and knowledge management, covering a range of topics including software for web-based collaboration, open source search and visualisation technologies for scientific business applications, and specific software such as DesignTracker and Utopia Documents. Part four looks at semantic technologies such as Semantic MediaWiki, TripleMap and Chem2Bio2RDF, before part five examines clinical analytics, and validation and regulatory compliance of free/open source software. Finally, the book concludes by looking at future perspectives and the economics and free/open source software in industry.

  • Discusses a broad range of applications from a variety of sectors
  • Provides a unique perspective on work normally performed behind closed doors
  • Highlights the criteria used to compare and assess different approaches to solving problems

Data Simplification
by Jules J. Berman

Data Simplification: Taming Information With Open Source Tools addresses the simple fact that modern data is too big and complex to analyze in its native form. Data simplification is the process whereby large and complex data is rendered usable. Complex data must be simplified before it can be analyzed, but the process of data simplification is anything but simple, requiring a specialized set of skills and tools.

This book provides data scientists from every scientific discipline with the methods and tools to simplify their data for immediate analysis or long-term storage in a form that can be readily repurposed or integrated with other data.

Drawing upon years of practical experience, and using numerous examples and use cases, Jules Berman discusses the principles, methods, and tools that must be studied and mastered to achieve data simplification, open source tools, free utilities and snippets of code that can be reused and repurposed to simplify data, natural language processing and machine translation as a tool to simplify data, and data summarization and visualization and the role they play in making data useful for the end user.

  • Discusses data simplification principles, methods, and tools that must be studied and mastered
  • Provides open source tools, free utilities, and snippets of code that can be reused and repurposed to simplify data
  • Explains how to best utilize indexes to search, retrieve, and analyze textual data
  • Shows the data scientist how to apply ontologies, classifications, classes, properties, and instances to data using tried and true methods

Text Mining and Visualization
by Markus Hofmann, Andrew Chisholm

Text Mining and Visualization: Case Studies Using Open-Source Tools provides an introduction to text mining using some of the most popular and powerful open-source tools: KNIME, RapidMiner, Weka, R, and Python.

The contributors—all highly experienced with text mining and open-source software—explain how text data are gathered and processed from a wide variety of sources, including books, server access logs, websites, social media sites, and message boards. Each chapter presents a case study that you can follow as part of a step-by-step, reproducible example. You can also easily apply and extend the techniques to other problems. All the examples are available on a supplementary website.

The book shows you how to exploit your text data, offering successful application examples and blueprints for you to tackle your text mining tasks and benefit from open and freely available tools. It gets you up to date on the latest and most powerful tools, the data mining process, and specific text mining activities.



About apujb86