fix(guide): simplify directory structure

This commit is contained in:
Mrugesh Mohapatra
2018-10-16 21:26:13 +05:30
parent f989c28c52
commit da0df12ab7
35752 changed files with 0 additions and 317652 deletions

View File

@@ -0,0 +1,18 @@
---
title: Tableau
---
# Tableau
Tableau is a data visualization software for business intelligence. Tableau is a commerically available thought the Tableau Software foundation. Capabilites include highly interactive visualizations, mapping capabilites, and dashbaords. A roboust community for tableau visualizations exists online, where many individauls publically publish their dashboards.
# Getting Started with Tableau Public
If you want to learn Tableau on your own, it's possible to get a free license for the public version of Tableau. Not only can this allow you to import your own datasets into Tableau and create visualizations, but you can also publish these dashboards online and add to your public portfolio.
- Link: https://public.tableau.com/en-us/s/
- Start by creating an online account on Tableau's website, using the 'Sign In' button at the top right, and then clicking on 'Create one now for free'
- After you have done that, you can enter your email address and download the application to your local machine.
- If you want to learn from others, you can also download their published visualizations and take a look at specific elements.
### Links:
* [Tableau Website](https://www.tableau.com)
* [Tableau Public Gallery](https://public.tableau.com/en-us/s/gallery)

View File

@@ -0,0 +1,32 @@
---
title: Talend Open Studio
---
## Talend Open Studio
[Talend](https://www.talend.com/) [Open Studio](https://www.talend.com/products/talend-open-studio/) is an Open Source software suite with solutions for [Data Integration](https://www.talend.com/products/data-integration/data-integration-open-studio/), [Big Data](https://www.talend.com/products/big-data/big-data-open-studio/), [Data Preparation](https://www.talend.com/products/data-preparation/data-preparation-free-desktop/), [Data Quality](https://www.talend.com/products/talend-open-studio/data-quality-open-studio/), [Enterprise Service Bus](https://www.talend.com/products/application-integration/esb-open-studio/) and [Master Data Management](https://www.talend.com/products/mdm/mdm-open-studio/).
Thanks to the Open Source nature of this project, a thriving community helps to develop [extensions and components](https://exchange.talend.com/).
## Data Integration
Jumpstart your ETL projects and integrate data
![Data integration](https://www.talend.com/wp-content/uploads/connect-and-transform-data-in-no-time.jpg)
## Big Data
Simplify ETL for large and diverse data sets
![Big data](https://www.talend.com/wp-content/uploads/connect-and-transform-big-data.jpg)
## Data Preparation
Enable users to discover, blend and clean data
![Data preparation](https://www.talend.com/wp-content/uploads/get-your-time-back.jpg)
## Data Quality
Assess the accuracy and integrity of data
![Data quality](https://www.talend.com/wp-content/uploads/map-your-path-to-clean-data.jpg)
## Enterprise Service Bus
Speed up orchestration of applications and APIs
![Enterprise service bus](https://www.talend.com/wp-content/uploads/speed-up-integration-for-reuse-2.png)
## Master Data Management
Generate a single “version of the truth” for data
![Master data management](https://www.talend.com/wp-content/uploads/generate-a-single-version-of-the-truth-for-data.jpg)

View File

@@ -0,0 +1,29 @@
---
title: detail
---
## What is Data Science
Data Science is a multi-disciplinary field that combines skills in software engineering and statistics with domain experience to support the end-to-end analysis of large and diverse data sets, ultimately uncovering value for an organization and then communicating that value to stakeholders as actionable results.
## Data Scientist
Person who is better at statistics than any software engineer and better at software engineering than any statistician.
## What Skills Do You Need?
Mathematics - Calculus, Linear Algebra
Statistics - Hypothesis, Testing, Regression
Programming - SQL, R/Python
Machine Learning - Supervised and Unsupervised Learning, Model Fitting
Business/Product Intuition - Interpret and communicate results to non-technical audience
## Life Cycle
1 - Identify or Formulate Problem
2 - Data Preparation
3 - Data Exploration
4 - Transform and Select
5 - Build Model
6 - Validate Model
7 - Deploy Model
8 - Evalute or Monitor Results

View File

@@ -0,0 +1,98 @@
---
title: Flink Batch Example JAVA
---
## Flink Batch Example JAVA
Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.
### Prerequisites
* Unix-like environment (Linux, Mac OS X, Cygwin)
* git
* Maven (we recommend version 3.0.4)
* Java 7 or 8
* IntelliJ IDEA or Eclipse IDE
```
git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests # this will take up to 10 minutes
```
### Datasets
For the batch processing data we'll be using the datasets in here: [datasets](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)
In this example we'll be using the movies.csv and the ratings.csv, create a new java project and put them in a folder in the application base.
### Example
We're going to make an execution where we retrieve the average rating by movie genre of the entire dataset we have.
**Environment and datasets**
First create a new Java file, I'm going to name it AverageRating.java
The first thing we'll do is to create the execution environment and load the csv files in a dataset. Like this:
```
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<Long, String, String>> movies = env.readCsvFile("ml-latest-small/movies.csv")
.ignoreFirstLine()
.parseQuotedStrings('"')
.ignoreInvalidLines()
.types(Long.class, String.class, String.class);
DataSet<Tuple2<Long, Double>> ratings = env.readCsvFile("ml-latest-small/ratings.csv")
.ignoreFirstLine()
.includeFields(false, true, true, false)
.types(Long.class, Double.class);
```
There, we are making a dataset with a <Long, String, String> for the movies, ignoring errors, quotes and the header line, and a dataset with <Long, Double> for the movie ratings, also ignoring the header, invalid lines and quotes.
**Flink Processing**
Here we will process the dataset with flink. The result will be in a List of String, Double tuples. where the genre will be in the String and the average rating will be in the double.
First we'll join the ratings dataset with the movies dataset by the moviesId present in each dataset.
With this we'll create a new Tuple with the movie name, genre and score.
Later, we group this tuple by genre and add the score of all equal genres, finally we divide the score by the total results and we have our desired result.
```
List<Tuple2<String, Double>> distribution = movies.join(ratings)
.where(0)
.equalTo(0)
.with(new JoinFunction<Tuple3<Long, String, String>,Tuple2<Long, Double>, Tuple3<StringValue, StringValue, DoubleValue>>() {
private StringValue name = new StringValue();
private StringValue genre = new StringValue();
private DoubleValue score = new DoubleValue();
private Tuple3<StringValue, StringValue, DoubleValue> result = new Tuple3<>(name,genre,score);
@Override
public Tuple3<StringValue, StringValue, DoubleValue> join(Tuple3<Long, String, String> movie,Tuple2<Long, Double> rating) throws Exception {
name.setValue(movie.f1);
genre.setValue(movie.f2.split("\\|")[0]);
score.setValue(rating.f1);
return result;
}
})
.groupBy(1)
.reduceGroup(new GroupReduceFunction<Tuple3<StringValue,StringValue,DoubleValue>, Tuple2<String, Double>>() {
@Override
public void reduce(Iterable<Tuple3<StringValue,StringValue,DoubleValue>> iterable, Collector<Tuple2<String, Double>> collector) throws Exception {
StringValue genre = null;
int count = 0;
double totalScore = 0;
for(Tuple3<StringValue,StringValue,DoubleValue> movie: iterable){
genre = movie.f1;
totalScore += movie.f2.getValue();
count++;
}
collector.collect(new Tuple2<>(genre.getValue(), totalScore/count));
}
})
.collect();
```
With this you'll have a working batch processing flink application. Enjoy!.

View File

@@ -0,0 +1,71 @@
---
title: Flink
---
## Flink
Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.
The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.
Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.
Flink provides a high-throughput, low-latency streaming engine as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics.
Programs can be written in Java, Scala, Python, and SQL and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment.
Flink does not provide its own data storage system and provides data source and sink connectors to systems such as Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch.
![Flink workflow](https://flink.apache.org/img/flink-home-graphic-update.svg)
**What Is New in Apache Flink?**
* Flink implements actual streaming processing and not imitates it with micro-batch processing. In Spark streaming is a special case of batching, while in Flink batching is a special case of streaming (stream of a finite size)
* Flink has better support for cyclical and iterative processing
* Flink has lower latency and higher throughput
* Flink has more powerful windows operators
* Flink implements lightweight distributed snapshots that has low overhead and only-once processing guarantees in stream processing, without using micro batching as Spark does
* Flink supports mutable state in stream processing
### Features
* A streaming-first runtime that supports both batch processing and data streaming programs
* Elegant and fluent APIs in Java and Scala
* A runtime that supports very high throughput and low event latency at the same time
* Support for *event time* and *out-of-order* processing in the DataStream API, based on the *Dataflow Model*
* Flexible windowing (time, count, sessions, custom triggers) accross different time semantics (event time, processing time)
* Fault-tolerance with *exactly-once* processing guarantees
* Natural back-pressure in streaming programs
* Libraries for Graph processing (batch), Machine Learning (batch), and Complex Event Processing (streaming)
* Built-in support for iterative programs (BSP) in the DataSet (batch) API
* Custom memory management for efficient and robust switching between in-memory and out-of-core data processing algorithms
* Compatibility layers for Apache Hadoop MapReduce and Apache Storm
* Integration with YARN, HDFS, HBase, and other components of the Apache Hadoop ecosystem
### Flink Usage
Prerequisites for building Flink:
* Unix-like environment (We use Linux, Mac OS X, Cygwin)
* git
* Maven (we recommend version 3.0.4)
* Java 7 or 8
```
git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests # this will take up to 10 minutes
```
## Developing Flink
The Flink committers use IntelliJ IDEA to develop the Flink codebase.
We recommend IntelliJ IDEA for developing projects that involve Scala code.
Minimal requirements for an IDE are:
* Support for Java and Scala (also mixed projects)
* Support for Maven with Java and Scala
#### More Information:
* Flink website: <a href='https://flink.apache.org/' target='_blank' rel='nofollow'>Apache Flink</a>
* Flink documentation: <a href='https://ci.apache.org/projects/flink/flink-docs-release-1.3/' target='_blank' rel='nofollow'>flinkdocs</a>
* Quick flink tutorial: <a href='https://www.linkedin.com/pulse/introduction-apache-flink-quickstart-tutorial-malini-shukla/' target='_blank' rel='nofollow'>quick start</a>
* How to guide: <a href='https://data-artisans.com/blog/kafka-flink-a-practical-how-to' target='_blank' rel='nofollow'>howto</a>
* Flink vs Spark: <a href='http://www.developintelligence.com/blog/2017/02/comparing-contrasting-apache-flink-vs-spark/' target='_blank' rel='nofollow'>comparison</a>

View File

@@ -0,0 +1,55 @@
---
title: Hadoop
---
## ![Hadoop](http://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2014/08/Hadoop_logo_2.png)
### Did you know?
Hadoop is named after a toy elephant belonging to Doug Cutting's son. Doug chose the name for his open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop's logo.
### What is Hadoop?
Hadoop is a framework that allows the distributed processing of large data sets across a cluster of computers, using simple programming models. It enables scaling up from single servers to thousands of machines, each offering its own local computation and storage. Rather than rely on hardware to deliver high-availability, Hadoop itself is designed to detect and handle failures at the application layer. If one machine in a cluster fails, Hadoop can compensate for the failure without losing data. This enables the delivery of a highly-available service on top of a cluster of computers, each of which may be prone to failures.
In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled "MapReduce: Simplified Data Processing on Large Clusters." At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components now known as Hadoop moved out of Apache Nutch and was released.
### Why is Hadoop useful?
According to IBM: "Every day, 2.5 billion gigabytes of high-velocity data are created in a variety of forms, such as social media posts, information gathered in sensors and medical devices, videos and transaction records."
Some examples of frequently created data are:
- Metadata from phone usage
- Website logs
- Credit card purchase transactions
"Big data" refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data's format.
At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.
### Core Hadoop
Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.
HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.
### Hadoop Ecosystem
Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.
The Hadoop Ecosystem includes:
- Apache Hive
- Apache Pig
- Apache HBase
- Apache Phoenix
- Apache Spark
- Apache ZooKeeper
- Cloudera Impala
- Apache Flume
- Apache Sqoop
- Apache Oozie
#### More Information:
1. [Udacity course on hadoop](https://www.udacity.com/course/intro-to-hadoop-and-mapreduce--ud617)
1. [Apache Hadoop](http://hadoop.apache.org/)
1. [Big Data Hadoop Tutorial Videos by edureka!](https://www.youtube.com/playlist?list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD)

View File

@@ -0,0 +1,15 @@
---
title: Data Science Tools
---
## Data Science Tools
In this section, we'll have guides to a wide variety of tools used by data scientists.
Data scientists are inquisitive and often seek out new tools that help them find answers. They also need to be proficient in using the tools of the trade, even though there are dozens upon dozens of them. Overall, data scientists should have a working knowledge of statistical programming languages for constructing data processing systems, databases, and visualization tools. Many in the field also deem a knowledge of programming an integral part of data science; however, not all data scientist students study programming, so it is helpful to be aware of tools that circumvent programming and include a user-friendly graphical interface so that data scientists knowledge of algorithms is enough to help them build predictive models.
What is great about data science is that there are numerous pathways to becoming a data science. You don't have to necessarily have a degree in computer science or mathematics. With subject matter expertise, such as in biostatistics, geography or political science, you can acquire the skills to use data science in multiple ways. There are a plethora of online resources, boot camps and local meetups where you can immerse yourself in the data science community (see resources below).
There are a few tools that you can start learning to get into data science. R remains the leading tool, with 49% share, but use of the Python language is growing fast, and is approaching the popularity of R. RapidMiner remains the most popular general Data Science platform. Big Data tools used by almost 40%, and Deep Learning usage doubles.
Data Science is OSEMN (**O**btain, **S**crub, **M**odel, i**N**terpret) the Data.
There is one good resource for Data Science and Machine Learning by Open Source Data Science Masters. Follow on github datasciencemasters!!!
* [Resources for Data Science](https://github.com/datasciencemasters/go)

View File

@@ -0,0 +1,93 @@
---
title: Jupyter Notebook
---
## Jupyter Notebook
Jupyter Notebook is an open-source web application. It allows you to create and share documents that contain live code, equations, visualizations and narrative text.
The Jupyter Notebook helps you create and share documents containing live code, equations, visualizations and rich text.
You can use it for:
* data cleaning and transformation
* numerical simulation
* statistical modeling
* data visualization
* machine learning
<img src="https://github.com/indianmoody/images/blob/master/guide_fcc/guides_jupyter_snap.jpeg" width="400" height="300" />
See your results as you go step by step. Just like in this image.
## What Is A Jupyter Notebook?
In this case, "notebook" or "notebook documents" denote documents that contain both code and rich text elements, such as figures, links, equations. Because of the mix of code and text elements, these documents are the ideal place to bring together an analysis description and its results as well as they can be executed perform the data analysis in real time.
"Jupyter" is a loose acronym meaning Julia, Python, and R. These programming languages were the first target languages of the Jupyter application, but nowadays, the notebook technology also supports many other languages.
And there you have it: the Jupyter Notebook.
## What does it do?
As a server-client application, the Jupyter Notebook App allows you to edit and run your notebooks via a web browser. The application can be executed on a PC without Internet access or it can be installed on a remote server, where you can access it through the Internet.
Its two main components are the kernels and a dashboard.
A kernel is a program that runs and introspects the users code. The Jupyter Notebook App has a kernel for Python code, but there are also kernels available for other programming languages. The dashboard of the application not only shows you the notebook documents that you have made and can reopen but can also be used to manage the kernels: you can which ones are running and shut them down if necessary.
## Installation
You can use Anaconda or Pip to install Jupyter notebook.
For steps to do so, refer to the official guide
<a href='https://jupyter.readthedocs.io/en/latest/install.html'> here.</a>
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
=======
### Features
* No need to run your complete code file every time. Just run individual Notebook cell to evaluate specific piece of code.
* The Notebook has support for over 40 programming languages, including Python, R, Julia, and Scala.
* Notebooks can be shared with others using email, Dropbox, GitHub and the Jupyter Notebook Viewer.
* Your code can produce rich, interactive output: HTML, images, videos, LaTeX, and custom MIME types.
* Leverage big data tools, such as Apache Spark, from Python, R and Scala. Explore that same data with pandas, scikit-learn, ggplot2, TensorFlow.
The Jupyter notebook combines two components:
### A web application:
The Jupyter Notebook App helps to edit and run notebook documents in a web browser, combining explanatory text, mathematics, computations and rich media.
### Notebook document:
The Jupyter Notebook App can create a 'Notebook document' containing both code and rich text elements. A Notebook document can be both readable and executable.
These documents are produced by the Jupyter Notebook App.
## Jupyter Notebook App
As a server-client application, the Jupyter Notebook App allows you to edit and run your notebooks via a web browser.
The application can be executed on a PC without Internet access or it can be installed on a remote server, where you can access it through the Internet.
Its two main components are the kernels and a dashboard.
### Kernels
A kernel is a program that runs and introspects the users code. The Jupyter Notebook App has a kernel for Python code, but there are also kernels available for other programming languages.
### Dashboard
The dashboard of the application not only shows you the notebook documents that you have made and can reopen but can also be used to manage the kernels: you can which ones are running and shut them down if necessary.
### How notebooks work
Jupyter notebooks grew out of the IPython project started by Fernando Perez. IPython is an interactive shell, similar to the normal Python shell but with great features like syntax highlighting and code completion. Originally, notebooks worked by sending messages from the web app (the notebook you see in the browser) to an IPython kernel (an IPython application running in the background). The kernel executed the code, then sent it back to the notebook.
![Notebook architecture](https://jupyter.readthedocs.io/en/latest/_images/notebook_components.png)
When you save the notebook, it is written to the server as a JSON file with a **.ipynb** file extension
The new name Jupyter comes from the combination of **Ju**lia, **Py**thon, and **R**. there are a lot of kernels for different languages to use Jupyter. you could check the [list of available Jupyter kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
### Installing Jupyter Notebook
Jupyter notebooks automatically come with the distribution. You'll be able to use notebooks from the default environment.
To install Jupyter notebooks in a conda environment: `conda install jupyter notebook`
To install Jupyter notebooks with pip: `pip install jupyter notebook`
#### More Information:
* [Jupyter Org Website](http://jupyter.org)
* [Jupyter/IPython Notebook Quick Start Guide](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html)
* [What is Jupyter Notebook by codebasics, duration 8:24](https://www.youtube.com/watch?v=q_BzsPxwLOE)
* [Jupyter Notebook Tutorial / Ipython Notebook Tutorial, by codebasics, duration 24:07](https://www.youtube.com/watch?v=EEEZX_0FMEc)
* [More Information](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

View File

@@ -0,0 +1,56 @@
---
title: pandas
---
![Everybody loves pandas!](https://pandas.pydata.org/_static/pandas_logo.png "pandas")
## pandas
[pandas](http://pandas.pydata.org/) is a Python library for data analysis using data frames. Data frames are tables of data, which may conceptually be compared to a spreadsheet. Data scientists familiar with R will feel at home here. pandas is often used along with numpy, pyplot, and scikit-learn.
### Importing pandas
It is a widely used convention to import the pandas library using the alias `pd`:
```python
import pandas as pd
```
## Data frames
A data frame consist of a number of rows and column. Each column represents a feature of the data set, and so has a name and a data type. Each row represents a data point through associated feature values. The pandas library allows you to manipulate the data in a data frame in various ways. pandas has a lot of possibilities, so the following is merely scratching the surface to give a feel for the library.
## Series
Series is the basic data-type in pandas.A Series is very similar to a array (NumPy array) (in fact it is built on top of the NumPy array object).A Series can have axis labels, as it can be indexed by a label with no number indexing for the location of data. It can hold any valid Python Object like List,Dictionary etc.
## Loading data from a csv file
A `.csv` file is a *comma separated value* file. A very common way to store data. To load such data into a pandas data frame use the `read_csv` method:
```python
df = pd.read_csv(file_path)
```
Here, `file_path` can be a local path to a csv file on you computer, or a url pointing to one. The column names may be included in the csv file, or the may be passed as an argument. For more on this, and much more, take a look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?highlight=read_csv#pandas.read_csv).
## Getting an overview of a data frame
To show the first few rows of a data frame, the `head` method is useful (once more this should sound familiar to R programmers):
```python
df.head()
```
This will show the first 5 rows of the data frame.
To show more than first 5 rows simply put the number of rows you want to print out inside the `head` method.
```python
df.head(10)
```
This will show the first 10 rows of the data frame.
To show the last few rows of a data frame, the `tail` method is useful (once more this should sound familiar to R programmers):
```python
df.tail()
```
This will show the last 5 rows of the data frame.
## Subsetting: Getting a column by name
A data frame can be subset in many ways. One of the simplest is getting a single column. For instance, if the data frame `df` contains a column named `age`, we can extract it as follows:
```python
ages=df["age"]
```
#### More Information:
1. [pandas](http://pandas.pydata.org/)
2. [read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?highlight=read_csv#pandas.read_csv)
3. [head](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head)

View File

@@ -0,0 +1,20 @@
---
title: Power BI
---
# Power BI:
Power BI is a commercially available product by Microsoft to provide data insights and visualizations for the data curious.
## Core features:
Power BI has many features for data ingestion, analysis, and consumption:
* Data content packs to bring in data from many sources, including GitHub, Facebook, AWS, Azure, and most SQL databases
* Intuitive data modeling and transormation tools that can be automated for tables of similar structure or content
* Stock charts and graphs (including force directed!) for drag-and-drop report and dashboard building
* R console for advanced statistical analysis
* Embedded capabilities for viewing of dashboards across different business groups or for public consumption
## How it is used for Data Science:
While it is not as powerful for complex inference analysis or map reducing, it thrives in presenting data in interactive dashboards for typical business-type data consumers. PowerBI is the bridge between the data scientist and the rest of the business world who want to data for decision making in easy to understand visualizations. By leveraging the Office 365 environment that many consumers are familiar with, the Data Scientist can bring the most complex custom analysis or visualization through the R console for interaction by whomever is needed.
## Lab
<a href="https://github.com/Microsoft/computerscience/blob/master/Labs/Big%20Data%20and%20Analytics/Power%20BI/Power%20BI.md">Using Microsoft Power BI to Explore and Visualize Data</a>

View File

@@ -0,0 +1,15 @@
---
title: scikit-learn
---
## Scikit-learn
Scikit-learn is a popular open-source machine learning library for Python, built off of previous packages like numpy and scipy. There is code available to handle everything from importing data and data cleaning to model preparation and testing.
## Installation
To install scikit-learn in a conda environment: `conda install scikit-learn` <br>
To install scikit-learn with pip: `pip install scikit-learn`
## References
Scikit-learn main page: http://scikit-learn.org/stable/
Tutorials: http://scikit-learn.org/stable/tutorial/index.html

View File

@@ -0,0 +1,30 @@
---
title: Spark
---
## Spark
<a href='http://spark.apache.org/' target='_blank' rel='nofollow'>Spark</a> is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
## Core Features
Spark 2.0 has many new features:
* Native CSV data source, based on Databricks spark-csv module
* Off-heap memory management for both caching and runtime execution
* Hive style bucketing support
* Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
## How it is used for Data Science
Spark has become a standard tool in many data scientist's tool box. With its flexibility in API choices, any programmer can work with Spark in their preferred language. As noted by <a href='https://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists' target='_blank' rel='nofollow'>Cloudera</a>, Spark has gained popularity for many reasons:
* Being Scala-based, Spark embeds in any JVM-based operational system, but can also be used interactively in a REPL in a way that will feel familiar to R and Python users.
* For Java programmers, Scala still presents a learning curve. But at least, any Java library can be used from within Scala.
Sparks RDD (Resilient Distributed Dataset) abstraction resembles Crunchs PCollection, which has proved a useful abstraction in Hadoop that will already be familiar to Crunch developers. (Crunch can even be used on top of Spark.)
* Spark imitates Scalas collections API and functional style, which is a boon to Java and Scala developers, but also somewhat familiar to developers coming from Python. Scala is also a compelling choice for statistical computing.
* Spark itself, and Scala underneath it, are not specific to machine learning. They provide APIs supporting related tasks, like data access, ETL, and integration. As with Python, the entire data science pipeline can be implemented within this paradigm, not just the model fitting and analysis.
* Code that is implemented in the REPL environment can be used mostly as-is in an operational context.
* Data operations are transparently distributed across the cluster, even as you type.
#### More information
* <a href='https://github.com/apache/spark' target='_blank' rel='nofollow'>Spark Github page</a>
* <a href='https://en.wikipedia.org/wiki/Apache_Spark' target='_blank' rel='nofollow'>Wikipedia</a>

View File

@@ -0,0 +1,23 @@
---
title: TensorFlow
---
## Tensor Flow
<a href='https://www.tensorflow.org/' target='_blank' rel='nofollow'>TensorFlow</a> is an open source software library for Machine Intelligence.
With the aim of conducting research in these fascinating areas, the Google team developed TensorFlow.
## How it is used for Data Science
TensorFlow is an open source software library for numerical computation using data flow graphs.
Nodes in the graph represent mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them. The flexible architecture
allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile
device with a single API. TensorFlow was originally developed by researchers and engineers
working on the Google Brain Team within Google's Machine Intelligence research organization
for the purposes of conducting machine learning and deep neural networks research, but the
system is general enough to be applicable in a wide variety of other domains as well.
For more information, visit the <a href='https://github.com/tensorflow' target='_blank' rel='nofollow'>TensorFlow Github page</a>
## Lab
<a href="https://github.com/Microsoft/computerscience/blob/master/Labs/AI%20and%20Machine%20Learning/TensorFlow/TensorFlow.md">TensorFlow</a>

View File

@@ -0,0 +1,9 @@
---
title: Bokeh
---
Bokeh is a Python interactive visualization library, providing the elegant and concise interface to create plot, dashborads, and data applications.
### More Information:
[Bokeh Official Website](https://bokeh.pydata.org/en/latest/)
[Bryan Van de Ven, PyBay2016, 55:47](https://www.youtube.com/watch?v=xqwCxuEBpxk)
[Bryan Van de Ven, PyData SF 2016, 2:14:00](https://www.youtube.com/watch?v=M1-MVYLONZc)