Arc Forumnew | comments | leaders | submitlogin
2 points by akkartik 5403 days ago | link | parent

"Do you mean adding things on top of PLT Scheme to turn it into arc, so that there's no separate "compilation" step?"

Yes.

"I'm unsure to what end this would be."

The reason I ask: I'm running into performance issues with readwarp. I can't have more than 3 users at a time before I start seeing the occasional timeout. At 5 users timeouts become common.

PLT scheme has a GIL and can only run one thread at a time. I think I'm running into that limitation this quickly because arc is a factor of n slower from the rewriting. Caching isn't an option like for HN because it's literally picking one story at a time to render.

So I may have to move reluctantly to a different scheme, or even to common lisp. I'm wondering how much of my house I can carry on my back to the ghetto :)

Even in the context of PLT scheme it seems a valid question to ask why we need this compilation step. Is it buying us enough to justify itself? So far it seems like what we get is nil-termination, optional args and parallel/destructuring assignment. I think a library of macros without those features would still seem mostly like arc to me, and it would be significantly faster.



4 points by shader 5403 days ago | link

You should try and see what subset of the language you actually use, and figure out how hard that would be to implement directly on top of scheme.

I don't really know, but the main thing that I see ac providing is the ability to do arbitrary transformations on the first element of a list before it's passed to scheme. This allows functional position definitions for all kinds of objects, and could allow for lots of different kinds of function dispatch (such as dispatch on second arg, which I think pg was planning)

Also, ac (and arc in general) give us a succinctly defined language whose source fits in only two files of less than 1k lines each. This means that the language is easy to learn about and from, and easy to modify. I don't know how many times I've thought "gee, I wish arc could do something cool, like be able to print the source code for global functions" and just gone to the source and added it.

Mind you, that doesn't preclude macros, and in fact arc already is in some ways defined as a set of macros over scheme. It's just that one of those macros is recursive, and visits all of the elements of the source tree ;)

But if performance is what you're after, defining the minimal palatable subset of arc as macros on scheme may be the way to go. It won't be as nice, but it might be a lot faster.

Why not try it, and see? Arc isn't that big, and at least three implementations have been written already.

-----

2 points by akkartik 5403 days ago | link

"the main thing that I see ac providing is the ability to do arbitrary transformations on the first element of a list.."

Ah yes. Everytime I try to enumerate them I miss one. But this page has the whole list :)

"ac (and arc in general) give us a succinctly defined language whose source fits in only two files of less than 1k lines each."

Oh, I'm extremely sensitive to that, believe me. That's what's awesome about arc, and it's kept me using it full-time for 9 months now. Making it a layer of macros would actually not detract from that IMO.

As I've done more with arc I've had to deal with the underlying platform anyway. So I feel like there's no point even trying to abstract that away too much.

"Why not try it, and see? Arc isn't that big, and at least three implementations have been written already."

Yes. You know, in any other circumstances I would consider it a fun experiment. But since I started working on readwarp full-time I constantly worry that I'm going to spend time and burn cash on what's fun rather than what takes the product forward. I'll prob still end up doing this, but just agonize a lot more before I do.

-----

2 points by shader 5402 days ago | link

Well, if speed is your main requirement, and you have an actual app that you want to run faster, I would examine it and determine a) what subset of arc you use and b) what subset of arc you need. If it's most of arc, then rewriting it in scheme might be pretty hard.

Of course, first determine how much speed you actually need. And as aw said, rewriting in a faster language won't necessarily help you scale much anyway. Finding where you need the extra speed is definitely a prerequisite to figuring out the best way to get it.

Ultimately, we can't really tell you what to do, because we don't know nearly as much as you do about the needs you have for your application, the bottlenecks that are occurring, or how much time it would cost to re-implement your system on top of scheme directly.

Lisp is great for bottom up design, but whether you can get all the way up to your application before reaching arc first is not something that we can't really judge from here. Certainly some subset of arc can be built as mere macros over scheme. But only you can determine whether that subset is comfortable enough to use, and fast enough to solve your problems.

-----

2 points by aw 5402 days ago | link

I don't understand your comment that caching isn't an option because it's picking one story at a time to render. You have a bunch of feeds containing a bunch of stories. Some other background thread / process / program / box can fetch stories, render them, and store the rendered story to disk. Now all your user interface has to do is choose which story to present to the user.

In most programs that are too slow it's a small percentage of the code which is causing the slowness. If you work on making all of your code faster, you're wasting 95% of your effort, because it's only 5% of the code which is causing a problem. So it's worthwhile to run timing tests and figure out what exactly is the slow part, and work on algorithms such as caching to make those parts fast enough.

If you can't think of any algorithms or methods to handle the parts that are too slow, then you can rewrite your program (either the whole program or just the slow parts) in a faster language. But this will only take you so far. If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users? At some point you need to figure out what your strategy is for handling lots of users, and rewriting in a faster language is only a temporary fix for that issue.

-----

2 points by akkartik 5402 days ago | link

"If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users?"

We're talking about 60 concurrent users. If I could do 60 requests per second I'd be a happy man. 60 requests per second is 216k requests/hr. Assuming 30 page views per happy readwarp user, that's 7200 uniques per hour. Let's say daily uniques at that point is 6 times that, 40k uniques per day. With 40k uniques a day I'd be a happy man. More importantly, 40k uniques will take me n years to get to. So you see, I don't need to "scale beyond 60 users".

Right now readwarp isn't even at 3 requests per second because requests can take several seconds with contention. But that's not ideal either. And it's largely because of the language and the platform.

-----

1 point by akkartik 5402 days ago | link

What you describe in the first paragraph is not caching, it's precomputation. It may help, but it's a completely different performance calculation. The reason caching works for HN is that even if it gets 10k users in 5 minutes it has to generate a dozen pages once that they all see. When I have to manage 100x as many pages in the 'cache' I have to be a lot more careful not to waste time on stories that never get shown.

Precomputation doesn't actually save very much work at all. All it does is move latencies around, often by throwing hardware at the problem. And it can easily go off the rails and end up doing a lot of wasted work without helping latency.

Precomputation does help sometimes. In fact I use it in readwarp in a couple of places. But it's not as straightforward a cost-benefit analysis as caching would be. When caching works it's just brain-dead to reason about. With precomputation I have to constantly monitor whether I'm getting the bang for my buck. And I have to be conservative.

The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it.

Hmm, I probably haven't fully explored all these ideas. Every now and then my server just hangs for a few seconds and becomes completely unresponsive, and I haven't been able to figure out what it's doing at that point, to get a stack trace or something. For any timing instrumentation I insert, times often vary wildly, by an order of magnitude. I'm starting to suspect PLT's timings shoot through the roof because of contention between threads.

I should just try to offload all my state in a way that can handle multiple coexisting arc interpreters. I'll probably do that. Arc seems quite memory-heavy, though. I'm not sure how many interpreters can coexist on a machine.

You're right that I'll still have to scale in another language, but language features can make my life easier or harder when I have to scale to multiple machines. Having a global interpreter lock perhaps makes it hard to avoid thread contention, and to balance responsiveness with throughput. I know the python folks have had similar issues. OTOH, having just one request per interpreter makes arc far more inefficient in RAM than say mongrel.

-----

1 point by akkartik 5402 days ago | link

"The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it."

Actually there's 3 stages. Selecting a story (A), rendering it (B), and updating user data structures when the vote on it returns (C). Each user request is seeing the total time for C+A+B.

When I eliminate rendering (the server does all the computations, but doesn't send the story to the client) I still see timeouts with 3 concurrent requests.

---

Ah, I got an improvement. Remember those 'couple of places' I was doing precomputation? I was trying to batch up stage-A computations 5 at a time in a precompute thread when a user's running low on stories. I got rid of them, and the average latency for everyone went up from 1s to 3-4s even with just one concurrent request at a time. But I can now handle 6 concurrent requests without timing out.

When I eliminate stage C the number starts inching up to 10. At 10 median request time is 13s. But at least stuff is getting processed.

My lessons:

1. When the interpreter is single-threaded try to minimize use of threads. Contention gets expensive fast. I am now used to seeing processing latency double when #concurrent requests doubles.

2. When things are slow resist the temptation to precompute. It always makes hot spots harder to see in the future. It's a last resort. And it'll always cost you on a single-threaded platform. I'm now going to rerun my profiling.

3. When you're stuck come blather about your problems on the arc forum. Thanks guys.

-----

1 point by aw 5401 days ago | link

Contention gets expensive fast

Can you provide an example that shows the problem with thread contention?

-----

1 point by akkartik 5401 days ago | link

  (= tab* (table))

  (defop || req
    (repeat (int:arg req "iters")
      (= (tab*:rand) (obj 1 2 3 4))))

  (new-thread asv)
So I can vary the number of assignments each op does in the URL.

Experiment 1: just a hundred sequential requests.

  $ ab -n 100 "http://127.0.0.1:8080/?iters=10"
On my laptop 98% of requests complete in 3ms.

Experiment 2: let's inch up to 2 concurrent requests at a time.

  $ ab -n 100 -c 2 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require at least 4ms. For 98% it's 10-15ms.

Experiment 3:

  ab -n 100 -c 10 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require 20-40ms.

In general the 50% latency number that apacheBench returns seems to increase linearly with the number of concurrent requests. This rule of thumb has held in my app as well over several major changes. Right now readwarp requests take 800ms on localhost and 2-3s from the internet at large. Not much room for those latencies to grow linearly.

-----

1 point by aw 5401 days ago | link

I can't compare your first and last example because one says 98% and the other 50%.

If it were an apples to apples comparison would you agree that it wouldn't be surprising that if the server were handling 10x the number of simultaneous requests that each request would take 10x as long?

The only apples to apples comparison is going from one simultaneous request to two simultaneous requests which has the increase from 3ms to 10-15ms, which seems like about double what we'd expect if using threads had no overhead at all.

-----

1 point by akkartik 5401 days ago | link

When I give just the 98% number I meant the 50% number stays the same (+/- 1ms). We're at the limits of ab's time resolution here.

Yes, none of this is surprising given that PLT is single-threaded. I'm trying to remember my class in queuing theory.. If each request takes m ms to process, n concurrent requests will take mn ms to process, (n-1)m ms of queuing delay and m ms of processing time. The average latency seen would be mn/2 which scales by n. As the number of concurrent requests grows, the curve of latency to number of concurrent requests would be linear.

If you now add a second queue, the curve stays flat until 2 requests and then moves linearly up. Each queue is only n/2 long, so average queuing delay is (n-1)m/4 ms. With q queues, on average there's only (n-1)m/2q ms of queuing delay.

Summary: if you can have multiple truly-concurrent threads, you'll be able to overlap that many concurrent requests without any increase in latency, and the slope of the curve past that point will be inversely proportional to the number of threads. Does this make sense?

---

After I wrote my previous comment I started looking in greater depth into why I'm taking 800ms to process a request. I'm really not doing that much. I'll post an update with what I find. Perhaps I haven't spend enough time attacking m before looking to expand q.

-----

1 point by aw 5401 days ago | link

When you say that PLT is "single-threaded", is what you're referring to is that PLT only uses one CPU?

Well, yes, if you have 10 CPU's then you can handle 10 simultaneous requests in the same elapsed time as 1 CPU can handle 1 request. If you don't have any overhead in communication between CPU's, locks, etc.

I think it's clearer to talk about having one CPU or multiple CPU's instead of "truly-concurrent threads" or "thread contention".

You can have multiple threads interleaving on a CPU, and in some systems that support more than one CPU, a thread can run for a while on one CPU and then get moved to running on a different, less busy CPU.

Now if you said that that on one CPU 100 requests one at a time took a total of 300ms to process but 100 requests running interleaved on one CPU took a total of 800ms to process, then we would have evidence that there's too much overhead in thread context switching or some other problem.

-----

1 point by akkartik 5399 days ago | link

I think the quality of the thread implementation has a role to play. When a process waits on IO the OS will find something else to do. I don't know if user-level threads can be that smart, and I don't know if PLT threads are that smart.

Without casting aspersions, it's not just about number of processors. Very few programs avoid IO altogether, and if there's IO some programs will be more clever than others in moving work around.

-----

1 point by aw 5399 days ago | link

In a properly working thread system, all threads that aren't waiting for something should be given a fair slice of the available CPU time. In particular, no thread that is ready to run should be blocked or delayed because some other thread is waiting on I/O.

If you are wondering if PLT threads work correctly, isn't this rather easy to test? Do an A/B test with some threads that do some computation with and without some other threads waiting on I/O and see if the runtimes are the same.

-----

1 point by rfindler 5400 days ago | link

It seems like things are straightened out here, but I thought I'd point out that the upcoming version of PLT allow multi-core parallelism via 'future':

http://pre.plt-scheme.org/docs/html/guide/performance.html?q...

Also: PLT does not have a GIL, but (futures aside) it has a similar mechanism that works better for implementing continuations (and worse for interoperating with C code)

-----

1 point by akkartik 5400 days ago | link

Thanks for the tip and the correction. Can you elaborate on how PLT's atomicity interoperates poorly with C code?

-----