For one of my current pet projects I need to parse potentially huge amount of data in JSON. The obvious solution in such a case is to consume and parse the input data as a stream. This blog post describes the approach I solved the task with.

Since I had no experience with parsing JSON streams in Haskell, I started with googling. And I found nothing. No, there are couple of blog posts in the Internet about the topic, but part of them I can't use (because I'm not that smart, maybe) and the other part is totally outdated and/or doesn't actually work.

So, I decided to go with json-stream - an applicative incremental JSON parser for Haskell. And I also has taken some expiration from this post on Stackoverflow: Why doesn't print force entire lazy IO value?.

#3Below is a working example of parsing a JSON stream from Shopify API.

For this blog post, I took Shopify API as an example to play with. You can find more details on the API at their shopify dev section.

First, let's define a customer and customer address data objects:

module Models ( Customer(..), CustomerAddress(..)) where

import qualified Data.Text as T

data CustomerAddress =
  CustomerAddress
  { addressExtId :: !Int
  , customerExtId :: !Int
  , firstName :: !T.Text
  , lastName :: !T.Text
  , companyName :: !T.Text
  , address1 :: !T.Text
  , address2 :: !T.Text
  , city :: !T.Text
  , province :: !T.Text
  , provinceCode :: !T.Text
  , country :: !T.Text
  , countryCode :: !T.Text
  , countryName :: !T.Text
  , zipCode :: !T.Text
  , phone :: !T.Text
  , customerName :: !T.Text
  } deriving (Eq, Show)

data Customer =
  Customer
  { customerExtId :: !Int
  , email :: !(Maybe T.Text)
  , createdAt :: !T.Text
  , updatedAt :: !T.Text
  , firstName :: !T.Text
  , lastName :: !T.Text
  , customerState :: !T.Text
  , note :: !(Maybe T.Text)
  , addresses :: ![CustomerAddress]
  } deriving (Eq, Show)

Now, let's write a parser that can consume data and produces the customer address object:

import Control.Applicative (many)
import qualified Data.Text as T
import qualified Data.JsonStream.Parser as J
import Data.JsonStream.Parser ((.:), (.:?), (.|))
import qualified Models as M

addressParser :: J.Parser M.CustomerAddress
addressParser =
  M.CustomerAddress
    <$> "id" .: J.integer
    <*> "customer_id" .: J.integer
    <*> "first_name" .: J.string .| ""
    <*> "last_name" .: J.string .| ""
    <*> "company" .: J.string .| ""
    <*> "address1" .: J.string
    <*> "address2" .: J.string .| ""
    <*> "city" .: J.string
    <*> "province" .: J.string
    <*> "province_code" .: J.string
    <*> "country" .: J.string
    <*> "country_code" .: J.string
    <*> "country_name" .: J.string
    <*> "zip" .: J.string
    <*> "phone" .: J.string .| ""
    <*> "name" .: J.string

Next, we write a parser for customer object:

customerParser :: J.Parser M.Customer
customerParser =
  M.Customer
    <$> "id" .: J.integer
    <*> "email" .:? J.string
    <*> "created_at" .: J.string
    <*> "updated_at" .: J.string
    <*> "first_name" .: J.string .| ""
    <*> "last_name" .: J.string .| ""
    <*> "state" .: J.string
    <*> "note" .:? J.string
    <*> many ("addresses" .: J.arrayOf addressParser)

And now let's write a function that consumes input data and uses the both parsers for producing the consumer object:

customersParser :: J.Parser M.Customer
customersParser =
  J.objectWithKey "customers" $ J.arrayOf customerParser

So, we have written the parser and now it's time to pull some data from the Internet and feed it to the parser. For simplicity, I use here a fake API URL and omit authorization - please, refer to the API documentation for relevant details. This function opens a new connection to the API endpoint and feeds the content to parseWith function that parses the input stream and counts number of the parsed objects.

import qualified Network.HTTP.Client as H2
import qualified Network.HTTP.Client.TLS as H2
import qualified Data.ByteString as B

someFunc :: IO ()
someFunc = do
    manager <- H2.newManager H2.tlsManagerSettings
    request <- H2.parseRequest "https://my-haskell-shop.myshopify.com/admin/api/2021-01/customers.json?limit=50"
    H2.withResponse request manager $ \response -> do
      putStrLn "The status code was: "
      print (H2.responseStatus response)
      chunk <- H2.responseBody response
      cnt <- parseWith (H2.responseBody response) customersParser chunk
      putStrLn ("parsed " ++ show cnt ++ " orders")

parseWith :: IO B.ByteString -> J.Parser Order -> B.ByteString -> IO Int
parseWith refill scheme inp = do
    let pout = J.runParser' scheme inp
    doparse pout 0
    where
      doparse (J.ParseDone _) cnt = return cnt
      doparse (J.ParseFailed err) cnt = return cnt
      doparse (J.ParseYield v next) cnt = p v >> doparse next (cnt + 1)
      doparse (J.ParseNeedData cont) cnt = do
          dta <- refill
          doparse (cont dta) cnt

For this example I used stack and LTS 17.7. Relevant dependencies are:

  aeson >=1.5.6.0
  , base >=4.7 && <5
  , bytestring >=0.10.12.0
  , http-client >=0.6.4.1
  , http-client-tls >=0.3.5.3
  , json-stream >= 0.4.2.4
  , text >=1.2.4.1

This approach worked pretty well for me.