Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Making your data lovely!
Prioritising, cleaning, extraction, transformation, automation
Pia Waugh
Director of Gov 2.0 an...
22
Key Benefits to the Public Service in Opening Data
• Efficiencies from proactively publishing common requests
• Cheaper...
33
Tips for ensuring benefits realisation of open data
• Adopt an approach of “data user and developer empathy”
• Data pub...
44
Data on the inside
• Do you know what data you have internally?
• Are you considering all data types?
• How embedded is...
55
Rub a dub data
• If a machine can’t read it, a machine can’t make an API
• Some data has specialised data formats, some...
66
What you need is clean sheets
• Don’t merge cells. Sorting and other manipulations people may want to apply to your dat...
77
Automate your reporting
http://ckan.org/2015/09/18/pyramids-pipelines-and-a-can-of-sweave-ckan-asia-pacific-meetup/
88
Automating updates
Automation involves system to system updates to save you time & money.
Three broad approaches:
1. Wr...
99
Support
• http://toolkit.data.gov.au is updated regularly. Recent updates include:
• How to automate data updates to da...
1010
Quality – improve over time
The 5 Star Data Quality standard developed by
Sir Tim Berners-Lee will be used on data.go...
1111
Data integration and aggregation
• Challenging but great potential for improved policy/services.
• Unit record sharin...
1212
data.gov.au
Free, cloud, scalable API enabled platform for hosting government data.
Staged approach
1. Publishing (20...
1313
Open Data Portals
Council Portals:
• City of Melbourne
• City of Brisbane
1414
Some Case Studies
• Publishing Budget 2014 Data Report
• Open data – Transforming the Provider / Stakeholder Paradigm...
1515
The future is here....
And it is already widely distributed
http://www.flickr.com/photos/mr_matt/35688926
22/
Challen...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Data Interface Test Automation for Internal & External Data Feeds
Next
Download to read offline and view in fullscreen.

Share

Open data presentation on tools and automation

Download to read offline

This is a short presentation about how to make your data lovely, including how to prioritise, clean, extract, transform and automate data publishing in your organisation.

Open data presentation on tools and automation

  1. 1. 1 Making your data lovely! Prioritising, cleaning, extraction, transformation, automation Pia Waugh Director of Gov 2.0 and Data Department of Finance Soon to be Prime Minister & Cabinet
  2. 2. 22 Key Benefits to the Public Service in Opening Data • Efficiencies from proactively publishing common requests • Cheaper and more modular services delivery • Reduced regulatory burden through machine readable data supporting compliance and automated reporting • Better policy outcomes by leveraging cross-agency data • More consistency & less duplication across government • Improved opportunities to leverage innovation and collaboration (citizens, industry, other depts) • Opportunities to improve data quality through verifiable public contributions
  3. 3. 33 Tips for ensuring benefits realisation of open data • Adopt an approach of “data user and developer empathy” • Data publishing built into your BAU • Initial focus on data that supports you  build capability • Consume your own data APIs (apps, datavis, BI, etc) • Ensure you consider: • Quality – no one can use bad data, but perfect is enemy of the good • Currency – is it up to date? How often is it updated? • APIs – is it programmatically available? • Publishing – have you provided supporting materials (taxonomies)? • Discoverability – is it hosted or linked on data.gov.au? • Reusability – have you tested it with data users? • Licensing – Creative Commons By Attribution the default • Automation wherever possible!
  4. 4. 44 Data on the inside • Do you know what data you have internally? • Are you considering all data types? • How embedded is data driven decision making? • How can you upskill the whole organisation? • Do you know what your external data needs are? • How are you measuring and monitoring success? Data infrastructure to support your organisation should be extendable to support sharing/publishing
  5. 5. 55 Rub a dub data • If a machine can’t read it, a machine can’t make an API • Some data has specialised data formats, some commonalities • Tabular, spatial, real time, unstructured, etc • Most data comes from somewhere, use the source Luke! • Machines and humans have different needs
  6. 6. 66 What you need is clean sheets • Don’t merge cells. Sorting and other manipulations people may want to apply to your data assume that each cell belongs to one row and column. • Don’t mix data and metadata (e.g. date of release, name of author) in the same sheet. • The first row of a data sheet should contain column headers. None of these headers should be duplicates or blank. The column header should clearly indicate which units are used in that column, where this makes sense. • The remaining rows should contain data, one datum per row. Don’t include aggregate statistics such as TOTAL or AVERAGE. You can put aggregate statistics in a separate sheet, if they are important. • Numbers in cells should just be numbers. Don’t put commas in them, or stars after them, or anything else. If you need to add an annotation to some rows, use a separate column. • Use standard identifiers: e.g. identify countries using ISO 3166 codes rather than names. • Don’t use only colour or other stylistic cues to encode information. If you want to colour cells according to their value, use conditional formatting. • Leave the cell blank if a value is not available. • If you provide pivot tables, make sure the underlying data is available separately too. • If you also want to create a human-friendly presentation of the data, do so by creating another sheet in the same workbook and referencing the appropriate cells in the canonical data sheet http://www.clean-sheet.org/
  7. 7. 77 Automate your reporting http://ckan.org/2015/09/18/pyramids-pipelines-and-a-can-of-sweave-ckan-asia-pacific-meetup/
  8. 8. 88 Automating updates Automation involves system to system updates to save you time & money. Three broad approaches: 1. Write scripts to push or pull data updates using an API directly from the source. Usually doesn’t require much data manipulation. 2. Adopt a tool like Taverna, FME or Splunk to extract, clean/manipulate, and then push data to the data.gov.au (CKAN/geoserver) API directly. 3. Use the data.gov.au (CKAN) to schedule pull updates from your data, but most agencies don’t do that as they prefer to push updates. The data.gov.au team strongly encourage you to gain at least one geek in you data team so you can experiment with code and tools to best meet your needs. “With much help and encouragement from the support team at data.gov.au, we dipped our toes into the CKAN API waters. As a DotNet shop we were keen to limit the technology landscape and sought to automate the upload using DotNet. The CKAN API is refreshingly lightweight with a simple authentication process and messaging.” -- ABN Lookup Team Code at https://github.com/datagovau/ckan-api-examples
  9. 9. 99 Support • http://toolkit.data.gov.au is updated regularly. Recent updates include: • How to automate data updates to data.gov.au with FME • Improved information on how to clean data • How to manage your own catalogue harvesting • Government data landscape to identify projects of use • Open Data Community Forum – soon to be moved to analyticsspace • Talk to your colleagues across government(s) • Other sources • Communities of interest: Data Science Meetup groups, Data Analytics Centre of Excellence, Linked Data Working Group, National Statistical Service, etc • GovHack Developers Kit: Become a data scientist in an hour, data tools, APIs, datavis, spatial, mashup techniques, statistical
  10. 10. 1010 Quality – improve over time The 5 Star Data Quality standard developed by Sir Tim Berners-Lee will be used on data.gov.au in the coming month or two to indicate data quality. Aim for quality web services. API quality will also be looked at soon, including potentially a 5 star API standard. http://5stardata.info/en/
  11. 11. 1111 Data integration and aggregation • Challenging but great potential for improved policy/services. • Unit record sharing is complex, privacy concerns for personal data. • Personal unit record data is mostly useful to researchers, appropriate mechanisms with legal, technical, ethical constraints to access such data. • Data aggregated by common spatial boundaries is comparative across datasets and over time. • Unfortunately, data owners traditionally aggregate to boundaries that constantly change (electorates, postcodes, etc). • The Australian Statistical Geography Standard (ASGS) provides a consistent set of spatial boundaries that can be mapped to other needs. • Anonymisation on the fly APIs also provide mechanism for appropriate public/agency access to unit record level data (e.g. ABS.Stat) http://statistical-data-integration.govspace.gov.au/ https://toolkit.data.gov.au/index.php?title=Definitions#Types_of_data
  12. 12. 1212 data.gov.au Free, cloud, scalable API enabled platform for hosting government data. Staged approach 1. Publishing (2013 – mid 2014) Improving the functionality and ease of publishing for agencies with training and documentation 2. Value realisation (2014-2015) Providing useful front end tools for data.gov.au including data visualisation and analysis tools. Publishing quality data a pre-requisite. 3. Data quality (2014-2015) Looking at ways to provide agencies the ability to accept iterative data improvements in a verifiable way Features • Support for tabular, spatial and data models • Options for hosting, linking or catalogue harvesting • Manual and automated publishing options • API access to government data • Easy to publish, download & interact • Use cases and site|data|org analytics • Data Request Site • Metadata harvesting from gov data gateways • National Map integration • Federated search for discoverability In Planning • 5 star quality plugin • Selective crowdsourcing for updates • League Table
  13. 13. 1313 Open Data Portals Council Portals: • City of Melbourne • City of Brisbane
  14. 14. 1414 Some Case Studies • Publishing Budget 2014 Data Report • Open data – Transforming the Provider / Stakeholder Paradigm • On the Value of Open Roof Prints • 100 years of patent and IP data released on data.gov.au More available along with tech support at http://toolkit.data.gov.au Other Australian case studies/documentation • SA Open Data Toolkit • QLD Government Case Studies • Victorian Government Showcase • NSW Apps Showcase • ACT examples
  15. 15. 1515 The future is here.... And it is already widely distributed http://www.flickr.com/photos/mr_matt/35688926 22/ Challenge #1: Collaborate Challenge #2: Share Challenge #3: Measure Challenge #4: Play Questions? @piawaugh @datagovau data.gov.au toolkit.data.gov.au
  • RenaeMorris1

    Nov. 2, 2016
  • jim.croft

    Sep. 30, 2015
  • wioota

    Sep. 27, 2015
  • wariola

    Sep. 27, 2015
  • sweemenghacker

    Sep. 27, 2015
  • AlicaDaly

    Sep. 27, 2015

This is a short presentation about how to make your data lovely, including how to prioritise, clean, extract, transform and automate data publishing in your organisation.

Views

Total views

1,607

On Slideshare

0

From embeds

0

Number of embeds

123

Actions

Downloads

19

Shares

0

Comments

0

Likes

6

×