Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Intro to Data Science for Non-Data Scientists
Next
Download to read offline and view in fullscreen.

24

Share

Download to read offline

Python for Data Science - TDC 2015

Download to read offline

In this talk, we introduce the Data Scientist role , differentiate investigative and operational analytics, and demonstrate a complete Data Science process using Python ecosystem tools, like IPython Notebook, Pandas, Matplotlib, NumPy, SciPy and Scikit-learn. We also touch the usage of Python in Big Data context, using Hadoop and Spark.

Python for Data Science - TDC 2015

  1. 1. PYTHON FOR DATA SCIENCE Gabriel Moreira Machine Learning Engineer @gspmoreira 2015
  2. 2. Why so much buzz?
  3. 3. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  4. 4. Big Data
  5. 5. ONLINE PERSONALIZATION
  6. 6. WHAT IS DATA SCIENCE http://drewconway.com
  7. 7. WHAT IS DATA SCIENTIST A Data Scientist is someone with deliberate dual personality who can first build a curious business case defined with a telescopic vision and can then dive deep with microscopic lens to sift through DATA to reach the goal while defining and executing all the intermittent tasks. http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist
  8. 8. http://nirvacana.com/thoughts/becoming-a-data-scientist/ Data Science MetroMap Curriculum
  9. 9. TYPES OF ANALYTICS Investigative Analytics Operational Analytics Consumers: Humans Consumers: Machines http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/
  10. 10. [Hillary Mason, Data Scientist] Inquire( Obtain( Scrub( Explore( Model( iNterpret( DATA SCIENCE IS IOSEMN
  11. 11. Inquire( Obtain( Scrub( Explore( Model( iNterpret( PYTHON IS IOSEMN js Outsider
  12. 12. ANALYTICS CASE
 CORPORATE SOCIAL NETWORKS
  13. 13. Full Data Analysis demo available in IPython Notebook bit.ly/python4ds_nb
  14. 14. Investigative Analytics Consumers: Humans
  15. 15. Inquire( Obtain( Scrub( Explore( Model( iNterpret(
  16. 16. INQUIRE 1.Which communities are more popular? 2.Is the user engagement increasing? 3.What is the distribution of publishing time? 4.What is the distribution of user interactions? 5.Is there a relationship between publishing hour and number of interactions?
  17. 17. Inquire( Obtain( Scrub( Explore( Model( iNterpret(
  18. 18. OBTAIN •Download data from another location (e.g., a web page or server) •Query data from a database (e.g., MySQL or Oracle) •Extract data from an API (e.g.,Twitter, Facebook) •Extract data from another file (e.g., an HTML file or spreadsheet) •Generate data yourself (e.g., reading sensors or taking surveys)
  19. 19. READING INTERACTIONS FROM CVS
  20. 20. READING POSTS FROM JSON LINES
  21. 21. Inquire( Obtain( Scrub( Explore( Model( iNterpret(
  22. 22. SCRUB
  23. 23. SCRUB
  24. 24. SCRUB
  25. 25. SCRUB Dealing with nulls
  26. 26. SCRUB
  27. 27. Inquire( Obtain( Scrub( Explore( Model( iNterpret(
  28. 28. 1 - WHICH COMMUNITIES ARE MORE POPULAR?
  29. 29. 1 - WHICH COMMUNITIES ARE MORE POPULAR?
  30. 30. 2 - IS USER ENGAGEMENT INCREASING?
  31. 31. 2 - IS USER ENGAGEMENT INCREASING?
  32. 32. 3 - WHAT ISTHE DISTRIBUTION OF PUBLISHINGTIME?
  33. 33. 4 - HOW ISTHE DISTRIBUTION OF USER INTERACTIONS?
  34. 34. 4 - HOW ISTHE DISTRIBUTION OF USER INTERACTIONS?
  35. 35. 4 - HOW ISTHE DISTRIBUTION OF USER INTERACTIONS?
  36. 36. 5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND NUMBER OF INTERACTIONS?
  37. 37. 5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND NUMBER OF INTERACTIONS?
  38. 38. 5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND NUMBER OF INTERACTIONS?
  39. 39. 5 - RELATIONSHIP BETWEEN PUBLISHINGTIME AND NUMBER OF INTERACTIONS? http://viverdeblog.com/melhoresahorarios-para-postar-nas-redes-sociais/
  40. 40. Operational Analytics Consumers: Machines
  41. 41. Inquire( Obtain( Scrub( Explore( Model( iNterpret(
  42. 42. 1. Discover the most relevant words in the posts 2. Find related posts, with similar content Operational AnalyticsTasks example Find Related Posts
  43. 43. 1 - RELEVANT WORDS IN A POST TF-IDF - More “relevant" terms in a document are frequent terms in the document and rare in other documents
  44. 44. 1 - RELEVANT WORDS IN A POST
  45. 45. 1 - RELEVANT WORDS IN A POST
  46. 46. 1 - RELEVANT WORDS IN A POST
  47. 47. BONUS - GLOBAL RELEVANTTERMS [ALL POSTS]
  48. 48. 2 - SIMILAR POSTS Cosine Similarity
 Measure of similarity between two vectors 
 being the cosine of the angle between them.
  49. 49. 2 - SIMILAR POSTS
  50. 50. 2 - SIMILAR POSTS Original Post Did you ever wonder how great it would be if you could write your jmeter tests in ruby ?This projects aims to do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can be used to validate your Architecture. modulo 13 arch definition architecture validation | academia de arquitetura
 
 Most similar post (cosine similarity = 0.30)
 Foram disponibilizados no site Enterprise Architecture, na parte de Knowledge Base de performance, alguns how-tos relacionados a testes de performance.Entre eles, como definir os requisitos (throughput, cálculo de threads para o JMeter etc.), utilização do JMeter, geração de massa de dados e monitoramento. planning and executing performance testing | enterprise architecture - how to identify performance acceptance criteria | enterprise architecture - how to geracao de massa de dados | enterprise architecture - how to jmeter | enterprise architecture - how to monitoramento | enterprise architecture
  51. 51. SIMILAR PEOPLE!
  52. 52. Inquire( Obtain( Scrub( Explore( Model( iNterpret(
  53. 53. INTERPRET •Drawing conclusions from your data •Evaluating what your results mean •Communicating your result
  54. 54. DATA PRODUCTS “If information has context and the context is interactive, insights are not predictable." [Agile Data Science, O’Reilly, 2014]
  55. 55. SENTIMENT ANALYSIS bit.ly/eleicoes2014debatesbt Analytical Dashboard
  56. 56. SENTIMENT ANALYSIS Analytical Dashboard bit.ly/eleicoes2014debatesbt
  57. 57. NETWORK ANALYSIS https://linkedjazz.org/network/ js
  58. 58. What about 
 Python for Big Data?
  59. 59. PYTHON ON HADOOP Streaming HADOOPY Pig UDFs 
 in Jython
  60. 60. HADOOP STREAMING Hadoop Streaming - Allows MapReduce jobs from any executable script - including Python

  61. 61. HADOOP STREAMING http://workingsweng.com.br/2014/04/clusterizando-raios-com-hadoop-e-k-means-em-map-reduce/ K-Means with Python on MapReduce 140.000 lightnings em 28/02/2014 in 137 data files Running on Amazon Elastic Map Reduce •Instances: 10 m1.small •Time (k=10): 10 iterations => 32 minutes •Time (k=50): 50 iterations => 164 minutes
  62. 62. IS DATA SCIENTISTTHE NEW WEBMASTER?
  63. 63. [Doing Data Science, O’Reilly, 2014]
  64. 64. DATA SCIENCE COURSES • Introduction to Data Science (Univ. of Washington) • Data Science specialization (Johns Hopkins) • Intro to Hadoop and MapReduce (Cloudera) • Machine Learning (Stanford) • Statistical Learning (Stanford) • Mining Massive Datasets (Stanford) • Scalable Machine Learning (Berkeley) http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/
  65. 65. BOOKS
  66. 66. Happy data geeking!
  67. 67. Gabriel Moreira @gspmoreira http://about.me/gspmoreira Thank you! 2015 PYTHON FOR DATA SCIENCE Slides: http://bit.ly/python4ds_tdc
  • TagesuDagefe

    Nov. 9, 2020
  • SateeshK5

    Aug. 9, 2019
  • TimothyLombard1

    May. 27, 2019
  • Emanuelq23

    Apr. 15, 2017
  • lytseng

    May. 17, 2016
  • anykarolyne

    Mar. 20, 2016
  • powerirs

    Mar. 19, 2016
  • davividal

    Feb. 24, 2016
  • ramiroluz

    Feb. 24, 2016
  • KrishnakumarMenon

    Feb. 4, 2016
  • therealauser

    Jan. 16, 2016
  • lsyang35

    Jan. 4, 2016
  • givanaldo

    Dec. 25, 2015
  • DevanshiVerma

    Nov. 7, 2015
  • choeungjin

    Oct. 14, 2015
  • guruprasad110

    Sep. 22, 2015
  • dannyeuu

    Aug. 29, 2015
  • zahpee

    Aug. 10, 2015
  • kurosouza

    Aug. 7, 2015
  • gsotes62

    Aug. 3, 2015

In this talk, we introduce the Data Scientist role , differentiate investigative and operational analytics, and demonstrate a complete Data Science process using Python ecosystem tools, like IPython Notebook, Pandas, Matplotlib, NumPy, SciPy and Scikit-learn. We also touch the usage of Python in Big Data context, using Hadoop and Spark.

Views

Total views

5,435

On Slideshare

0

From embeds

0

Number of embeds

927

Actions

Downloads

263

Shares

0

Comments

0

Likes

24

×